<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Avalanche: Putting the Spirit of the Web back into Semantic Web Querying</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cosmin Basca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abraham Bernstein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DDIS, Department of Informatics, University of Zurich</institution>
          ,
          <addr-line>Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <fpage>64</fpage>
      <lpage>79</lpage>
      <abstract>
        <p>Traditionally Semantic Web applications either included a web crawler or relied on external services to gain access to the Web of Data. Recent efforts have enabled applications to query the entire Semantic Web for up-to-date results. Such approaches are based on either centralized indexing of semantically annotated metadata or link traversal and URI dereferencing as in the case of Linked Open Data. By making limiting assumptions about the information space, they violate the openness principle of the Web - a key factor for its ongoing success. In this article we propose a technique called Avalanche, designed to allow a data surfer to query the Semantic Web transparently without making any prior assumptions about the distribution of the data - thus adhering to the openness criteria. Specifically, Avalanche can perform “live” (SPARQL) queries over the Web of Data. First, it gets on-line statistical information about the data distribution, as well as bandwidth availability. Then, it plans and executes the query in a distributed manner trying to quickly provide first answers. The main contribution of this paper is the presentation of this open and distributed SPARQL querying approach. Furthermore, we propose to extend the query planning algorithm with qualitative statistical information. We empirically evaluate Avalanche using a realistic dataset, show its strengths but also point out the challenges that still exist.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the introduction of the World Wide Web, the way we share knowledge and
conduct day to day activities has changed fundamentally. With the advent of
the Semantic Web, a Web of Data is emerging interlinking ever more machine
readable data fragments represented as RDF documents or queryable
semantic endpoints. It is in this ecosystem that unexplored avenues for application
development are emerging.</p>
      <p>
        While some application designs include a Semantic Web (SW) data crawler,
others rely on services that facilitate access to the Web of Data (WoD) either
through the SPARQL protocol or various APIs (i.e. Sindice or Swoogle). As the
mass of data continues to grow – currently Linked Open Data (LOD) [
        <xref ref-type="bibr" rid="ref1 ref21">1</xref>
        ]
accounts for 4.7 billion triples – the scalability factor combined with the Web’s
uncontrollable nature and its heterogeneity will give raise to a new set of
challenges.
      </p>
      <p>
        A question marginally addressed today is: How to query the Web of Data
ondemand, without hindering the flexible openness principle of the Web – seen as
the ability to query independent un-cooperative semantic databases, not
controlling their distribution, their availability or having to adhere to fixed publishing
guidelines (i.e. LOD). The underlying assumptions of WoD, as with the WWW,
are that (a) there exists no distribution pattern of information onto servers,
(b) there is no guarantee of a working network, (c) there is no centralized
resource discovery system, (d) there exists a standard (HTTP) for the retrieval of
information, and (e) the size of RDF data no longer allows us to consider
singlemachine systems feasible. With the serendipitous nature of Semantic Web [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
querying the global information space gives rise to new possibilities unthought
of before.
      </p>
      <p>Several approaches that tackle the problem of querying the entire Web of
Data have emerged lately. One solution provides a centralized queryable
endpoint for the Semantic Web that caches all data. This approach allows searching
for and joining potentially distributed data sources. It does, however, incur the
significant problem of ensuring an up-to-date cache and might face crucial
scalability hurdles in the future, as the Semantic Web continues to grow.</p>
      <p>Other approaches use the guidelines of LOD publishing to traverse the linked
data cloud in search of the answer. Obviously, such a method produces up-to-date
results and can detect data locations only from the URIs of bounded entities in
the query. Relying on URI structure, however, may cause significant scalability
issues when retrieving distributed data sets, since (1) the servers dereferenced in
the URI may become overloaded, and (2) it limits the possibilities of rearranging
(or moving) the data around by binding the id (i.e. URI) to storage location.</p>
      <p>Finally, traditional database federation techniques have been applied to query
WoD. They rely on statistical information from queryable endpoints that are
used by a mediator to build efficient query execution plans. Their main drawback
is that some query execution engine is aware of the data distribution ex-ante (i.e.,
before the query execution). Moreover, in most cases, data sources even need
to register themselves at the query execution engine with detailed information
about the data they contain.</p>
      <p>In this paper, we propose Avalanche, a novel approach for querying the
Web of Data that (1) makes no assumptions about data distribution,
availability, or partitioning, (2) provides up-to-date results, and (3) is flexible as it
makes no assumption about the structure of participating triple stores.
Consequently, it addresses the shortcomings of previous approaches. To query WoD
Avalanche provides a novel technique via means of a two-phase protocol: a
discovery step, i.e. gathering statistical information about data distribution from
involved hosts, and a planning optimization step over the distributed SPARQL
endpoints. Hence, the main contributions of our approach are:
– on-demand transparent querying over the Web of Data, without any prior
knowledge about its distribution
– a formal description of our approach, together with possible optimizations
for each step
– a novel planning strategy and cost model for dealing with towards Web scale
graph data
– a reference implementation of the Avalanche technique</p>
      <p>In the remainder we first review the relevant related work of the current
stateof-the-art. Section 3 provides a detailed description of Avalanche. In Section
4 we evaluate several planning strategies and estimate the performance of our
system. In Section 5 we present several future directions and optimizations, and
conclude in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>Several solutions for querying the Web of Data over distributed SPARQL
endpoints have been proposed before. They can be grouped into two streams: (a)
distributed query processing and (b) statistical information gathering over RDF
sources.</p>
      <p>
        Research on distributed query processing has a long history in the database
field [
        <xref ref-type="bibr" rid="ref18 ref9">18, 9</xref>
        ]. Its traditional concepts are adapted in current approaches to
provide integrated access to RDF sources distributed over the Web. For instance,
Yars2 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is an end-to-end semantic search engine that uses a graph model to
interactively answer queries over structured and interlinked data, collected from
disparate Web sources. Another example is the DARQ engine [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which divides
a SPARQL query into several subqueries, forwards them to multiple, distributed
query services and, finally, integrates the results of the subqueries. Finally,
Rdfpeers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a distributed RDF repository that stores three copies of each triple in
a peer-to-peer network, by applying global hash functions to its subject,
predicate and object. Virtuoso [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a data integration software developed by OpenLink
Software, is also focused on distributed query processing. The drawback of these
solutions is, however, that they assume total control over the data distributions—
an unrealistic assumption in the open Web. Similarly, SemWIQ [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] provides a
mediator that distributes the execution of SPARQL queries transparently. Its
main focus is to provide an integration and sharing system for scientific data.
Whilst it does assume control over the instance distribution they assume perfect
knowledge about it. Addressing this drawback some [
        <xref ref-type="bibr" rid="ref17 ref20">20, 17</xref>
        ] propose to extend
SPARQL with explicit instructions where to execute certain sub-queries.
Unfortunately, this assumes an ex-ante knowledge of the data distribution on part of
the query writer. Finally, Hartig et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] describe an approach for executing
SPARQL queries over spaces structured according to the Web of Linked Data
rules [
        <xref ref-type="bibr" rid="ref1 ref21">1</xref>
        ]. Whilst they make no assumptions about the openness of the data space
the LOD rules requires them to place the data on the URI-referenced servers—a
limiting assumption for example when caching/copying data.
      </p>
      <p>
        Research on query optimization for SPARQL includes query rewriting [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
triple pattern optimization based on selectivity estimations [
        <xref ref-type="bibr" rid="ref13 ref14 ref19">13, 19, 14</xref>
        ], and on
other statistical information gathering over RDF sources [
        <xref ref-type="bibr" rid="ref10 ref5">10, 5</xref>
        ]. RDFStats [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
is an extensible RDF statistics generator that records how often RDF properties
are used and feeds automatically generated histograms to SemWIQ. Histograms
on the combined values of SPO triples have proved to be especially useful to
provide selectivity estimations for filters [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. For joins, however, histograms can
grow very large and are rarely used in practice. Another approach is to compute
ahead frequent paths (i.e., frequently occurring sequences of S, P or O) in the
RDF data graph and keep statistics about the most beneficial ones [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. It is
unclear how this would work in a highly distributed scenario. Finally, Neumann
et. al [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] claim that selectivity estimation is a worthwhile solution for tens of
millions of RDF triples, but unsuitable for billions of triples, because the size of
the data and the increasing diversity in property names lead to poor estimations,
thus misguiding the query optimizer.
3
      </p>
      <p>Avalanche — System Design and Implementation
In this section, we describe the overall design of Avalanche and the underlying
philosophy of the distributed query execution across large datasets spread over
multiple, uncooperative servers.</p>
      <p>The major design difference between Avalanche and previous systems is that
it assumes that the distribution of triples to machines participating in the query
evaluation is unknown prior to query execution. Hence, our approach follows
neither a federated nor a peer-to-peer model, instead the statistical discovery
phase that traditionally is reserved for the (parallel) mediator component in
clustered approaches, has become an individual step during each query execution
phase. In the remaining part of this section, we will first illustrate our approach
using a motivating example. This will lead the way towards thoroughly describing
the Avalanche components and its novelty.</p>
      <p>The system consists of six major components working together in a
parallelized pipeline: the Avalanche endpoints Web Directory or Search Engine, the
Statistics Requester, the Plan Generator, Plan Executor instances, Plan
Materializer instances and the Query Stopper component as seen in Figure 1.</p>
      <p>Avalanche comprises of two phases: the Query Preprocessing phase and
the parallel Query Execution phase. During Query Preprocessing,
participating hosts are identified via means of a Search Engine such as Sindice1 or Web
Directory. A lightweight endpoint-schema inverted index can also be used.
Ontological prefix (the shorthand notation of the schema – i.e. foaf) and schema
invariants (i.e. predicates, classes, labels, etc) are appropriate candidate entries
to index. After query parsing, this information is immediately available and
used to quickly trim down the number of potential endpoints. Then, all selected
Avalanche endpoints are queried for the cardinality (number of instances) of
each unbounded variable — statistical information that triple-stores generally
posses.</p>
      <p>In the Query Execution phase, first the query is broken down into the
superset of all molecules, where a molecule is a subgraph of the overall query graph.</p>
      <sec id="sec-2-1">
        <title>1 http://sindice.com/</title>
        <p>Statistics
Requester</p>
        <p>AVALANCHE endpoints Web Directory or</p>
        <p>Search Engine (i.e. http://sindice.com)</p>
        <p>Plan
Generator</p>
        <p>Plans
Queue</p>
        <p>Plan Executor
Plan Executor</p>
        <p>...</p>
        <p>Plan Executor</p>
        <p>Finished
Plans
Queue</p>
        <p>Plan Materializer
Plan Materializer</p>
        <p>...</p>
        <p>Plan Materializer
A combination of minimally overlapping molecules covering the directed query
graph is referred to as a solution. Binding all molecules in a given solution
to physical hosts (Avalanche endpoints) that may resolve them, transforms a
solution into a plan. Given the size of the Web and the unknown distribution
of the RDF data, Avalanche will try to optimize the execution of the query to
quickly find the first K results. The proposed planning system and algorithm,
though complete, will normally not be allowed to exhaust the entire search space
since the cost of doing so is prohibitively expensive. Instead, the planner
component strives to execute the most “promising” query plans first, while being
monitored by the Query Stopper for termination conditions. To further reduce
the size of the search space, a windowed version of the search algorithm can
be employed – i.e. with each exploratory step only the first M molecules are
considered, thus sacrificing completeness.</p>
        <p>As shown in Figure 1 the Plan Generator relies on statistics about the data
contained on the different hosts from the Statistics Requester. Any
generated plan gets put in the Plans Queue regardless if the planner finished its
overall tasks of exploring the plan space or not. Plans in the Plans Queue are
fetched by Plan Executors that execute them generating partial results in
parallel and put them in the Finished Plans Queue. There, they get fetched by one
of the parallel executing Plan Materializers, who will merge and materialize
the partial results.</p>
        <p>To put Avalanche into perspective consider the following motivating query
that executes over Linked Open Datasets describing movies and actors:
SELECT ?title ?photoCollection ?name WHERE {
?film dc:title ?title; movie:actor ?actor; owl:sameAs ?sameFilm.
?actor a foaf:Person; movie:actor_name ?name .
?sameFilm dbpedia:hasPhotoCollection ?photoCollection.
?sameFilm dbpedia:studio ‘‘Producers Circle’’; }
The goal of Avalanche is to return the list of all movie titles, their photo
collections and the names of starring actors, that have been produced at “Producers
Circle” studios – considering that the required information is spread with an
unknown distribution over several LOD endpoints.</p>
        <p>At a given moment during the execution of a plan, a Plan Executor instance
may find itself in the state depicted in Figure 2 (in depth description in Section
3.2). The plan is comprised of three molecules: M1, M2, M3 and three hosts are
involved: host A, host B and host C. Molecule M1 was reported to be highly
selective on host A (holding Linked Movie2 data), while the remainder of the
plan: molecule M2 and M3, is distributed between hosts B and C (both holding
DBPEDIA3 data). Given that we operate in an environment where bandwidth
cost is non-trivial we should not “just” transport all partial results to one central
server to be joined. Instead we start with executing the highly selective (or in
this case: with the lowest cardinality) molecule M1 on host A and then limit the
execution space on host B by sending over the results from host A. The process
repeats itself given the number of molecules in the plan and is finalized with a
merge/update operation in reverse join order.</p>
        <p>Molecule 1 - M1
?sameFilm
dbpedia:hasPhotoCollection
?photoCollection.
?sameFilm dbpedia:studio
ʻʻProducers Circleʼʼ.</p>
        <p>1) Join (M1, M2)
host A
2) R1 = Execute (M1)
12) R1 = Filter (R1, R2)</p>
        <p>3) Send(R1)
Linked Movie Database endpoint
?sameFilm</p>
        <p>?actor</p>
        <p>Molecule 2 - M2
?film dc:title ?title.
?film movie:actor ?actor.
?film owl:sameAs ?sameFilm.</p>
        <p>Molecule 3 - M3
?actor a foaf:Person.</p>
        <p>?actor movie:actor_name ?name.
5) Update (M2, M1)
5) Join (M2, M3)
8) Update (M3, M2)
11) Send(R2)</p>
        <p>9) Send(FR3)
host B
4) FR2 = ExecuteFilter (R1)
10) R2 = Filter (FR2)
6) Send(FR2)</p>
        <p>host C
7) FR3 = ExecuteFilter (R3)</p>
        <p>DBPEDIA endpoints</p>
        <p>It is important to note that to execute plans, hosts will need to share a
common id space – a given in Semantic Web via URIs. Naturally, using RDF
strings can be prohibitively expensive. To limit bandwidth requirements, we
chose to employ a single global id space in the form of the SHA family of hash
functions on the URIs.</p>
        <p>The remainder of this section will detail the functionality of the most
important elements: the Plan Generator, Plan Executor and Plan Materializer
as well as explain how the overall pipeline stops.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2 http://www.linkedmdb.org/ 3 http://dbpedia.org/About</title>
        <p>3.1</p>
        <sec id="sec-2-2-1">
          <title>Generating Query Plans</title>
          <p>The planner’s main focus is to generate query plans that are likely to produce
results fast with a minimum of cost. As shown in Algorithm 1 the planner will
try to optimize the construction of plans using a multi-path informed (best-first)
search strategy by maximizing the Objective function of a plan. Therefore, all
plans are generated in descending order of their objective function.</p>
          <p>Algorithm 1 The plan generator algorithm
Plan-Generator(M olecules, Hosts, Cardinalities)
1 f ringe = []
2 for each molecule M ∈ M olecules, host H ∈ Hosts
3 partialP lan = {M, H, N U LL, Cardinalities}
4 append(f ringe, partialP lan)
5 sort(f ringe)
6 while !f ringe.empty() // Loop through f ringe
7 best = GetFirstElementWithPositiveObjective(f ringe)
8 if PlanIsComplete(best) // all molecules assigned to host
9 sort(f ringe)
10 yield GetPlan(best) // returns results but continues planning
11 else // plan is incomplete
12 remM ol = GetRemainingConnectedMolecules(M olecules, best)
13 planF ringe = []
14 for each molecule M ∈ remM ol, host H ∈ Hosts
15 partialP lan = {M, H, best, Cardinalities}
16 append(planF ringe, partialP lan)
17 sort(planF ringe)
18 concatenate(f ringe, planF ringe)</p>
          <p>In defining the Objective function we use the statistical information
gathered beforehand (result set cardinality). To ensure the generation of most
productive plans, our function models the chance of finding a solution, utility U ,
divided by the cost of executing the query, C. Hence:</p>
          <p>Objective =</p>
          <p>U
C
(1)</p>
          <p>An emergent challenge from preserving the openness of the query process
and the flexibility of semantic data publishing, is denoted by the exponential
complexity class of the plan composition space. Thus the space complexity of
the problem is O(N 3), considering that the problem size increases by M ∗ H
with each step towards a complete plan, where H represents the total number of
hosts involved and M is a measure of the query complexity (i.e. the number of
unique molecules that can be extracted from the given query graph). A simple
calculation for the scenario where 1000 hosts are involved and a rather large
query (≈ 15 unbounded variables) might generate 500 molecules with the average
depth of a plan of 10 (molecules), results in 5 million possible combinations to
form plans. Not all combinations produce viable plans, so pruning low or no
utility plans early is desired as seen in line 7 of the planning algorithm.</p>
          <p>We follow the assumption that selective molecules – with low cardinalities –
will help the plan to converge faster. In the bootstrap phase the utility of the first
plan node is equal to the inverse of its cardinality: CN TN1 (where N 1 is node
1 and CN T is the cardinality) factored by the size of the plan (Edges(N 1)).
Further on, we consider a join where the best-case cardinality is the minimum
of the involved result set cardinalities (see Equation 2). We define the cost, C
for executing queries in Equation 3. The cost of the first node is assumed to be
constant. For all other nodes we combine:
– the network latency L (between two nodes)
– a measure of the time required to send the results from node N 1 to node</p>
          <p>N 2 given the bandwidth B
– the cost of executing on N 1 and N 2 as approximated by their cardinalities
Finally, we scale this result with a measure of the current molecule size (molecule
assigned to N 2) relative to the size of the whole solution, in order to encourage
the choice of nodes that aid convergence.</p>
          <p>UN1,N2 =</p>
          <p>ECdgNesT(NN11) ,
min(CN TN1, CN TN2), otherwise
first node
CN1,N2 =
1, first node
(L + CNTN1 + CN TN1 + CCNNTTNN21 ) Edges(Solution) , otherwise</p>
          <p>
            B Edges(N2)
Extended Utility Function The main drawback of this utility function is
that it assumes the lower cardinality of the two nodes is representative—an
assumption that is quite wrong when searching for “rare” results given a large
number of “promising” hosts. Therefore it disregards the actual join
probabilities. Consider the previous example query that goes out to two almost disjoint
RDF servers: one with DBPEDIA data and another with public social network
data. Assuming we found an actor through some other host, the utility function
will not be able to favor DBPEDIA over the other host, as it cannot evaluate
the actual number of joins. Hence, if the public social network host happens to
be using a better network connection, the planer will be lead astray. To
overcome this effect we need a measure of join-quality. Following [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] we employ
bloom-filters, which are space-efficient set representation bit vectors composed
of multiple hash functions. As stated by [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] bloom-filters allow for a statistically
solid estimation of the cardinality of the join between two sets:
where BF is a bloom filter, m is the number of bits in the bloom filter, k
represents the number of hash functions, Zi represents the number of zero bits
in BFi, and Z12 represents the number of zero bits in the magnitude of their
inner product.
          </p>
          <p>Since computing bloom filters for large sets is a costly operation, we
propose the use of bloom filters as an extension to the previously proposed utility
function only for highly selectivity molecules — where the cardinality is below a
manually set threshold. Given implementation specific, execution considerations
we empirically set the threshold to 1000 partial results (ids) for the given set.
Consequently the extended utility EU is now defined as follows:</p>
          <p>EUN1,N2 =
w1 · J OINBF (N1),BF (N2) + w2 · UN1,N2,
w2 · UN1,N2,</p>
          <p>N 1, N 2 selective
otherwise
(5)
where w1 and w2 are weights that define the importance of the employed
estimation methods. We chose w1 = 0.8 and w2 = 0.2 for our experiments, which
means that for selective molecules we favor a more expensive, but more realistic
estimation.</p>
          <p>Algorithm 2 The plan execution algorithm
Execute-Plan(P lan)
Algorithm 3 Materialing a resolved plan
Merge-Materialize(P lan, Solution, Query)
1 graph = getGraph(Solution) // the molecule graph
2 resultV ariables = getProjections(Query) // the result variables
3 resolved = [] // the bound result variables
4 results = [][] // the final table of results
5 while !resultV ariables.empty()
6 v1 = pop(resultV ariables) // get next unbound result variable
7 if resolved.empty() // no currently bound result variables
8 v2 = getNearestResultVariable(v1, resultV ariables, graph)
9 remove(projections, v2); push(resolved, v2)
10 else // there are currently bound result variable
11 v2 = getNearestResultVariable(v1, resolved, graph)
12 push(resolved, v2)
13 resultsT able = getMergeTable(v1, v2, graph) // merge partial results (id’s only)
14 if results.empty()
15 results = resultsT able
16 else
17 results = extend(results, resultsT able)
18 removeDuplicates(results)
19 materialize(results) // turn id’s into actual strings
20 return results
3.2</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Executing Plans</title>
          <p>Specifically, following Algorithm 2 we start by executing the most selective
molecule in the plan (steps 1 and 2 in Figure 2). To perform the join (lines
10-12 in Algorithm 2) we send the results to host B and execute the join there
(steps 3 and 4 in Figure 2). Similarly we join the remainder of the molecules.
After all join operations have ended, we need to let hosts A and B know of all
the elements that did not have a join-partner by updating its structure (lines
18-20 in Algorithm 2; steps 8 to 12 in Figure 2).</p>
          <p>To increase execution performance, since many plans contain overlapping
subqueries, we employ a memoization strategy by keeping partial results on
the respective hosts for the duration of the query execution, while at the same
time database caching strategies are in effect. As a further improvement,
sitelevel memory caches can be employed, bypassing the database altogether for
“popular” result sets.
3.3</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Materializing Plans</title>
          <p>Once a plan has finished its execution, the Plan Executor monitoring the
process will signal the Avalanche mediator by pushing the executed plan onto the
Finished Plans queue. Note that the executed plans do not contain the results
yet, since the matches are kept as partial tables on their respective hosts. Hence,
plans in the Finished Plans Queue will be handled by a Plan Materializer
that materializes the partial results as described in Algorithm 3. First, we get
an unbound result variable v1 (line 6). We then try to find the next possible
result variable that will produce the lowest number of merge operations
(procedure getNearestResultVariable in lines 8 or 11). Having chosen the next
result variable we create a partial result table (line 13) and merge it with the
global result table (lines 14-17). We finish by removing duplicates and replacing
all ids with the actual strings (lines 18 and 19). To further reduce the overhead
of sending the results between hosts, we use RLE compression.
3.4</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>Stopping the query execution</title>
          <p>Since we have no control over distribution and availability of RDF data and
SPARQL endpoints, providing a complete answer to the query is an unreasonable
assumption. Instead, the Query Stopper monitors for the following stopping
conditions:
– a global timeout set for the whole query execution
– returning the first K unique results to the caller
– to avoid waiting for the timeout when the number of results is K we
measure relative result-saturation. Specifically, we employ a sliding window
to keep track of the last n received result sets. If the standard deviation
(σ) of these sets falls below a given threshold then we stop execution. Using
Chebyshev’s inequality we stopped when 1 − σ12 &gt; 0.9.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Preliminary Evaluation</title>
      <p>In this section we describe the experimental evaluation of the Avalanche
system. We first succinctly introduce the setup and then discuss the two evaluated
properties: the query execution and plan convergence.
4.1</p>
      <sec id="sec-3-1">
        <title>Experimental setup</title>
        <p>We tested Avalanche using a five-node cluster. Each machine had 2GB RAM
and an Intel Core 2 Duo E8500 @ 3.16GHz. We chose this small number of nodes
to better illustrate Avalanche’s query handling strategies, but did not measure
its ability to scale.</p>
        <p>The data was gathered directly from the LOD cloud. Specifically, we
employed the IEEE (66K triples), DBLP (22 millions) and ACM (13 millions)
publication data. The datasets were distributed over a five-node cluster, split by
their origin and chronological order (i.e. ACM articles till 2003 on host A) as
shown in Table 4.1. Recall that as stated above Avalanche makes no
assumptions about the data distribution over the nodes.</p>
        <p>For the purpose of evaluating Avalanche we selected 3 SPARQL queries as
listed in Appendix A. The queries were chosen in increasing order of complexity
(in terms of number of unbound variables and triple patterns). We conducted
all query executions with the following parameters: 1) timeout set to 20 seconds,
2) a stop sliding window of size 5, 3) a saturation threshold of 0.9, and 4) a
selectivity threshold for bloom filter construction of 1000 while searching for a
maximum of 200 results.
Query execution Figure 3 graphs the number of query results (left) and the
execution time (right) for both the default utility U and the extended utility EU
introduced in Section 3. Note that the query execution time for the extended
utility is somewhat higher (lower than the timeout), but it does find more
answers to the queries. The time used for the extended utility is higher since gives
the better plans a higher priority and executes them earlier. The execution of
“useful” plans does take longer, since a non-useful plan is stopped as soon as an
empty join is discovered. Hence, the saturation condition will stop the default
utility earlier after having executed fewer useful plans. Given a large number of
hosts we expect that the overhead of cancelling non-useful plans will overcome
the cost of executing useful plans. Hence, the extended utility planer should
converge faster.</p>
        <p>As we see in this experiment, Avalanche is able to successfully execute
query plans and retrieves many up-to-date results without having any prior
knowledge of the data distribution. We, furthermore, see that different objective
functions have a significant influence on the outcome and should play a critical
role when deployed on the Semantic Web.</p>
        <p>Planner convergence A second issue we planned to investigate is the
usefulness of the convergence criteria introduced in Section 3.4. Figure 4 graphs
the total number of results against the number of new results where the data
points represent newly arriving—possibly empty—answer sets whilst disabling
the stopping condition.</p>
        <p>As an example consider query Q1. At first, the number of new results grows
to a certain level. But, after having gathered ≈ 140 results, no more new results
are received. A similar behavior can be seen for each of the three queries. Hence,
given the experimental results the choice of a stopping condition is pertinent.
The current stopping conditions would stop both queries Q1 and Q3 at the
right point when the correct plateau is reached. When considering the number
of results found (see also Figure 3), query Q2, however, is stopped somewhat
early in one of the local plateaus.</p>
        <p>Planner Convergence
180
165
150
135
1lts20
1su05
e
R90
ew75
N
#60
45
30
15
0
1
10</p>
        <p>100
# Total Results</p>
        <p>Limitations, Optimizations and Future Work
The Avalanche system has shown how a completely heterogeneous distributed
query engine that makes no assumptions about data distribution could be
implemented. The current approach does have a number of limitations. In particular,
we need to better understand the employed objective functions for the planner,
investigate if the requirements put on participating triple-stores are reasonable,
explore if Avalanche can be changed to a stateless model, and empirically
evaluate if the approach truly scales to large number of hosts. Here we discuss
each of these issues in turn.</p>
        <p>The core optimization of the Avalanche system lies in its cost and utility
function. The basic utility function only considers possible joins with no
information regarding the probability of the respective join. The proposed utility
extension U E estimates the join probability of two highly selective molecules.
Although this improves the accuracy of the objective function, its limitation to
highly selective molecules is often impractical, as many queries (such as our
example query) combine highly selective molecules with non-selective ones. Hence,
we need to find a probabilistic distributed join cardinality estimation for low
selectivity molecules. One approach might be the usage of bloom-filter caches to
store precomputed, “popular” estimates. Another might be investigating
sampling techniques for distributed join estimation.</p>
        <p>In order to support Avalanche existing triple-stores should be able to:
– report statistics: cardinalities, bloom filters, other future extensions
– support the execution of distributed joins (common in distributed databases),
which could be delegated to an intermediary but would be inefficient
– share the same same key space (can be URIs but would result in
bandwidthintensive joins and merges)
Whilst these requirements seem simple we need to investigate how complex these
extensions of triple-stores are in practice. Even better would be an extension
of the SPARQL standard with the above-mentioned operations, which we will
attempt to propose.</p>
        <p>The current Avalanche process assumes that hosts keep partial results
throughout plan execution to reduce the cost of local database operations and
that result-views are kept for the duration of a query. This limits the number
of queries a host can handle. We intend to investigate if a stateless approach is
feasible. Note that the simple approach—the use of REST-ful services—may not
be applicable as the size of the state (i.e., the partial results) may be huge and
overburden the available bandwidth.</p>
        <p>We designed Avalanche with the need for high scalability in mind. The
core idea follows the principle of decentralization. It also supports asynchrony
using asynchronous HTTP requests to avoid blocking, autonomy by delegating
the coordination and execution of the distributed join/update/merge operations
to the hosts, concurrency through the pipeline shown in Figure 1, symmetry by
allowing each endpoint to act as the initiating Avalanche node for a query
caller, and fault tolerance through a number of time-outs and stopping
conditions. Nonetheless, an empirical evaluation of Avalanche with a large number
of hosts is still missing—a non-trivial shortcoming (due to the lack of suitable,
partitioned datasets and the significant experimental complexity) we intend to
address in the near future.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper we presented Avalanche , a novel approach for querying the Web
of Data that (1) makes no assumptions about data distribution, availability,
or partitioning, (2) provides up-to-date results, and (3) is flexible since it
assumes nothing about the structure of participating triple stores. Specifically, we
showed that Avalanche is able to execute non-trivial queries over distributed
data-sources with an ex-ante unknown data-distribution. We showed two
possible utility functions to guide the planning and execution over the distributed
data-sources—the basic simple model and an extended model exploiting
joinestimation. We found that whilst the simple model found some results faster
it did find less results than the extended model using the same stopping
criteria. We believe that if we were to query huge information spaces the overhead
of badly selected plans will be subdued by the better but slower plans of the
extended utility function.</p>
      <p>To our knowledge, Avalanche is the first Semantic Web query system that
makes no assumptions about the data distribution whatsoever. Whilst it is only
a first implementation with a number of drawbacks it represents a first important
towards bringing the spirit of the web back to triple-stores—a major condition
to fulfill the vision of a truly global and open Semantic Web.</p>
      <p>Acknowledgements This work was partially supported by the Swiss National
Science Foundation under contract number 200021-118000. We would like to
acknowledge Cathrin Weiss and Rob H. Warren for their help and contribution
in the development and evolution of the ideas behind Avalanche.</p>
    </sec>
    <sec id="sec-5">
      <title>Appendix</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked data - The story so far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems (IJSWIS)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitzenmacher</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. B. I. M.</given-names>
            <surname>Mitzenmacher</surname>
          </string-name>
          .
          <article-title>Network applications of bloom filters: A survey</article-title>
          .
          <source>In Internet Mathematics</source>
          , pages
          <fpage>636</fpage>
          -
          <lpage>646</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Cai</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Frank</surname>
          </string-name>
          .
          <article-title>Rdfpeers: a scalable distributed RDF repository based on a structured peer-to-peer network</article-title>
          .
          <source>In 13th International World Wide Web Conference (WWW)</source>
          , pages
          <fpage>650</fpage>
          -
          <lpage>657</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>O.</given-names>
            <surname>Erling</surname>
          </string-name>
          . Virtuoso. In http://openlinksw.com/virtuoso/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karnstedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          , K.-U. Sattler, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          .
          <article-title>Data summaries for on-demand queries over linked data</article-title>
          .
          <source>In 19th International World Wide Web Conference (WWW)</source>
          ,
          <year>May 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. Decker.</surname>
          </string-name>
          <article-title>Yars2: a federated repository for querying graph structured data from the web</article-title>
          .
          <source>In 6th International Semantic Web Conference (ISWC)</source>
          , pages
          <fpage>211</fpage>
          -
          <lpage>224</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Freytag</surname>
          </string-name>
          .
          <article-title>Executing SPARQL queries over the Web of linked data</article-title>
          .
          <source>In 8th International Semantic Web Conference (ISWC)</source>
          , page 293309,
          <year>October 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Heese</surname>
          </string-name>
          .
          <article-title>The SPARQL query graph model for query optimization</article-title>
          .
          <source>In 4th European Semantic Web Conference</source>
          ,
          <year>June 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          .
          <article-title>The state of the art in distributed query processing</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <volume>32</volume>
          (
          <issue>4</issue>
          ):
          <fpage>422</fpage>
          -
          <lpage>469</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Langegger</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Wo</surname>
          </string-name>
          <article-title>¨ß. RDFStats - An extensible RDF statistics generator and library</article-title>
          .
          <source>In 8th International Workshop on Web Semantics</source>
          ,
          <string-name>
            <surname>DEXA</surname>
          </string-name>
          ,
          <year>September 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>A.</given-names>
            <surname>Langegger</surname>
          </string-name>
          , W. W¨oß, and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Bl¨ochl. A Semantic Web middleware for virtual data integration on the web</article-title>
          .
          <source>In 5th European Semantic Web Conference</source>
          ,
          <year>June 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>O.</given-names>
            <surname>Lassila</surname>
          </string-name>
          .
          <article-title>Programming Semantic Web applications: a synthesis of knowledge representation and semi-structured data Doctoral</article-title>
          .
          <source>Doctoral dissertation</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.</given-names>
            <surname>Maduko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Anyanwu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Sheth</surname>
          </string-name>
          .
          <article-title>Estimating the cardinality of RDF graph patterns</article-title>
          .
          <source>In 16th International World Wide Web Conference (WWW)</source>
          ,
          <year>May 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Scalable join processing on very large RDF graphs</article-title>
          .
          <source>In 36th International Conference on Management of Data (SIGMOD)</source>
          ,
          <year>June 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>B.</given-names>
            <surname>Quilitz</surname>
          </string-name>
          and
          <string-name>
            <given-names>U.</given-names>
            <surname>Leser</surname>
          </string-name>
          .
          <article-title>Querying distributed RDF data sources with SPARQL</article-title>
          .
          <source>The Semantic Web: Research and Applications</source>
          , pages
          <fpage>524</fpage>
          -
          <lpage>538</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Papapetrou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Siberski</surname>
          </string-name>
          .
          <article-title>Optimizing distributed joins with bloom filters</article-title>
          .
          <source>In ICDCIT '08: Proceedings of the 5th International Conference on Distributed Computing and Internet Technology</source>
          , pages
          <fpage>145</fpage>
          -
          <lpage>156</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>S.</given-names>
            <surname>Schenck</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          .
          <article-title>Networked graphs:a declarative mechanism for SPARQL rules, SPARQL views and RDF data integration on the web</article-title>
          .
          <source>In 17th International World Wide Web Conference (WWW)</source>
          ,
          <year>April 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Sheth</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Federated databases systems for managing distributed, heterogeneous and autonomous databases</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <volume>22</volume>
          (
          <issue>3</issue>
          ):
          <fpage>183</fpage>
          -
          <lpage>236</lpage>
          ,
          <year>September 1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>M. Stocker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Kiefer</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Reynolds</surname>
          </string-name>
          .
          <article-title>SPARQL basic graph pattern optimization using selectivity estimation</article-title>
          .
          <source>In 17th International World Wide Web Conference (WWW)</source>
          ,
          <year>April 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. J. Zem´anek, S. Schenk, and
          <string-name>
            <given-names>V.</given-names>
            <surname>Svatek</surname>
          </string-name>
          .
          <article-title>Optimizing SPARQL queries over disparate RDF data sources through distributed semi-joins</article-title>
          .
          <source>In 7th International Semantic Web Conference (ISWC)</source>
          ,
          <year>October 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>Query 1: SELECT ?title ?author ?date WHERE {</article-title>
          ?paperDBLP &lt;http://www.aktors.org/ontology/portal#has-title&gt; ?title . ?paperDBLP &lt;http://www.aktors.org/ontology/portal#has-author&gt; ?author . ?paperDBLP &lt;http://www.aktors.org/ontology/portal#has-date&gt; ?date . ?author &lt;http://www.aktors.org/ontology/portal#full-name&gt;
          <article-title>"Abraham Bernstein"</article-title>
          . } Query 2: SELECT ?name ?title WHERE { ?paper &lt;http://www.aktors.org/ontology/portal#has-author&gt; ?author . ?author &lt;http://www.aktors.org/ontology/portal#full-name&gt; ?name . ?paper &lt;http://www.aktors.org/ontology/portal#has-author&gt; ?avi . ?paper &lt;http://www.aktors.org/ontology/portal#has-title&gt; ?title . ?avi &lt;http://www.aktors.org/ontology/portal#full-name&gt;
          <article-title>"Abraham Bernstein"</article-title>
          . } Query 3: SELECT ?title ?date WHERE { ?author &lt;http://www.aktors.org/ontology/portal#full-name&gt;
          <article-title>"Abraham Bernstein"</article-title>
          . ?paper &lt;http://www.aktors.org/ontology/portal#has-author&gt; ?author . ?paper &lt;http://www.aktors.org/ontology/portal#has-title&gt; ?title . ?paper &lt;http://www.aktors.org/ontology/portal#has-date&gt; ?date . ?paper &lt;http://www.aktors.org/ontology/portal#article-of-journal&gt; ?journal . ?journal &lt;http://www.aktors.org/ontology/portal#has
          <article-title>-title&gt; "ISWC/ASWC"</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>