<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Avalanche: Putting the Spirit of the Web back into Semantic Web Querying</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cosmin Basca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abraham Bernstein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DDIS, Department of Informatics, University of Zurich</institution>
          ,
          <addr-line>Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Traditionally Semantic Web applications either included a web crawler or relied on external services to gain access to the Web of Data. Recent e orts, have enabled applications to query the entire Semantic Web for up-to-date results. Such approaches are based on either centralized indexing of semantically annotated metadata or link traversal and URI dereferencing as in the case of Linked Open Data. They pose a number of limiting assumptions, thus breaking the openness principle of the Web. In this demo we present a novel technique called Avalanche, designed to allow a data surfer to query the Semantic Web transparently. The technique makes no prior assumptions about data distribution. Speci cally, Avalanche can perform \live" queries over the Web of Data. First, it gets on-line statistical information about the data distribution, as well as bandwidth availability. Then, it plans and executes the query in a distributed manner trying to quickly provide rst answers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        With the rapid growth of the Web of Data, unexplored avenues for application
development are emerging. While some application designs include a
Semantic Web (SW) data crawler, others rely on services that facilitate access to the
Web of Data (WoD) either through the SPARQL protocol or various API's (i.e.
Sindice or Swoogle). As the mass of data continues to grow { currently LOD [
        <xref ref-type="bibr" rid="ref2 ref7">2</xref>
        ]
accounts for 4.7 billion triples { the scalability factor will give raise to a new set
of challenges. Marginally addressed today is the question: How to query the Web
of Data on-demand, without hindering the exible openness principle of the Web
{ seen as the ability to query independent un-cooperative semantic databases,
not controlling their distribution, their availability or having to adhere to xed
publishing guidelines (i.e. LOD). Currently, some approaches maintain a global
index over all SW data, striving to deliver up-to date results. They become
expensive as the quantity of data grows and demand rises. Derived from
traditional distributed database systems, they o er high performance, but break
the exibility and openness principle by assuming perfect knowledge about
participating data. Instance level federation (i.e. subjects are distributed to hosts
according to a known scheme) as in the case of SemWIQ [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] relaxes the
centralization constraints by increasing the exibility of the system while still holding
some constraints over data distribution. As proposed by Hartig et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] other
approaches are based on the principle of link traversal and URI dereferencing.
These algorithms are driven by LOD principles, o ering high exibility at the
cost of potentially expensive query processing (due to network latency) and the
impossibility to issue certain types of queries as pointed out by the authors.
      </p>
      <p>
        This demo proposes to address the issue of preserving the openness and
exibility of WoD while being able to query it \live". Our motivation lies in the
belief that such exibility is paramount for future development of the Semantic
Web { as it was for the WWW. Considering the Web's uncontrollable nature,
Avalanche [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] strives to nd the rst K results as fast as possible.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Avalanche | System</title>
    </sec>
    <sec id="sec-3">
      <title>Design and Implementation</title>
      <p>The system consists of six major components working together in a parallelized
pipeline: the Avalanche endpoints Web Directory or Search Engine, the
Statistics Requester, the Plan Generator, Plan Executor instances, Plan Materializer
instances and the Query Stopper component as seen in Figure 1.</p>
      <p>The algorithm is comprised of two steps: the Query Preprocessing step and
the parallel Query Execution step. During Query Preprocessing, participating
hosts are selected via means of a lightweight endpoint-schema inverted index.
Ontological pre x (the shorthand notation of the schema { i.e. foaf) and schema
invariants (i.e. predicates, classes, labels, etc) are appropriate candidate entries
to index. After query parsing, this information is immediately available and used
to quickly trim down the number of potential endpoints. Further all selected
Avalanche endpoints are queried for unbounded variables instance counts.</p>
      <p>Query Execution follows: rst the query is broken down into the superset of
all molecules, where a molecule is a subgraph of the overall query graph. A
combination of minimally overlapping molecules, covering the query graph, is
referred to as a solution. Binding all molecules in a given solution to physical
hosts that may resolve them, transforms a solution into a plan. It is the Plan
Generator's job to issue plans for execution, as soon as assembled. An emergent
challenge from preserving the openness of the query process and the exibility of
semantic data publishing, is denoted by the exponential complexity class of the
plan composition space. It is thus crucially important to issue \good" plans rst
(plans that aid in nding the answers fast) while pruning and stopping undesired
plans as soon as possible. To this end, the molecule space is trimmed empirically
and plans are generated ordered given their objective function. The objective of a
plan is represented as the ratio between the utility (number of results produced)
and the cost of execution (time spent to execute subqueries, distributed joins
and bandwidth consumption). Consider for example the following query over
RDF data describing publications:
SELECT ?name ?title WHERE {
?paper akt:has-author ?author; akt:has-author ?jim; akt:has-title ?title.
?author akt:full-name ?name. ?jim akt:full-name "James A. Hendler". }
Avlanache's goal is to return the list of all authors that wrote papers with
\James A. Hendler" (bound variable) and their titles, given that the required
data is spread with an unknown distribution over several independent hosts.</p>
      <p>At a given moment during the execution of a plan, a Plan Executor may
nd itself in the following state: molecule M1 (see Figure 2) was reported to be
highly selective on host A, while the remainder of the query (molecule M2) is a
low selectivity molecule on host B. Given that bandwidth and latency cost can
be high, partial results 1 are not sent to one central site to be joined. Instead
we start the execution with the highly selective molecule M1 (host A) and then
lter results on host B by sending over results from host A as in Figure 2.</p>
      <p>Similarly the Plan Materializer, materializes out-of-order nished plans, by
merging partial results and fetching their actual string representations. Since
we have no control over the distribution and availability of RDF data and data
providers (SPARQL endpoints), getting a complete answer to the query is an
unreasonable assumption. Instead the Query Stopper monitors for the following
stopping conditions : a global timeout, a rst K results policy { all execution
stops as soon as K (unique) results are returned to the caller { and nally, to
avoid waiting for the timeout when the number of results is K we measure
relative result-saturation (i.e. we stop at 90% saturation).</p>
      <p>
        The cost of expensive distributed n-way joins can be reduced via
bloomjoins [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We extend the objective function with a qualitative join estimation, as
described in [
        <xref ref-type="bibr" rid="ref3 ref8">3</xref>
        ] and since constructing bloom lters for large sets is expensive, we
only consider the highly selective triple patterns (a threshold set empirically).
1 It is important to note that to execute plans, hosts will need to share a common
id space { a given in Semantic Web via URIs. Naturally, using RDF strings can
be prohibitively expensive. To limit bandwidth requirements we, chose to employ a
single global id space in the form of the SHA family of hash functions on the URIs.
To demonstrate the Avalanche algorithm and evaluate its performance we
conducted a preliminary evaluation on the combined IEEE, ACM and DBLP
LOD datasets, totaling 35 million triples. We distributed the datasets over a
ve-node cluster, splitting by dataset and chronological order (i.e. ACM articles
till 2003 on host A). Each machine had 2GB RAM and an Intel Core 2 Duo
E8500 @ 3.16GHz. Avalanche was able to successfully execute query plans
and retrieve many up-to-date results without having any prior knowledge of the
data distribution. Furthermore, we observed that di erent objective functions
have a signi cant in uence on the outcome and should play a critical role when
deployed on the Semantic Web. The demo2 will mirror this setup and let people
pose arbitrary queries live.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In order to preserve the openness of the Web of Data, one has to make shallow
or no prior assumptions to data distribution and availability. Also given the
uncontrollable nature of the Web, a rst K results policy makes sense. For this
purpose we developed and present Avalanche a novel approach for querying the
Web of Data that: (1) makes no assumptions about data distribution, availability,
or partitioning, (2) provides up-to-date results, and (3) is exible as it makes no
assumption about the structure of participating triple stores. To our knowledge,
Avalanche 3 is the rst Semantic Web query system that makes no assumptions
about the data distribution whatsoever. Whilst, it is a rst prototype with a
number of drawbacks it represents a rst important step towards bringing the
spirit of the Web back to triple-stores | a major condition to ful ll the vision
of a truly global and open Semantic Web.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Basca</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          .
          <article-title>Avalanche: Putting the Spirit of the Web back into Semantic Web Querying</article-title>
          .
          <source>In Proceedings of the 6th International Workshop on Scalable Semantic Web Knowledge Base Systems</source>
          ,
          <year>November 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked data - The story so far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems (IJSWIS)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitzenmacher</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. B. I. M.</given-names>
            <surname>Mitzenmacher</surname>
          </string-name>
          .
          <article-title>Network applications of bloom lters: A survey</article-title>
          .
          <source>In Internet Mathematics</source>
          , pages
          <volume>636</volume>
          {
          <fpage>646</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Freytag</surname>
          </string-name>
          .
          <article-title>Executing SPARQL queries over the Web of linked data</article-title>
          .
          <source>In 8th International Semantic Web Conference (ISWC)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Langegger</surname>
          </string-name>
          , W. Wo , and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Blochl. A Semantic Web middleware for virtual data integration on the web</article-title>
          .
          <source>In 5th European Semantic Web Conference</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Papapetrou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Siberski</surname>
          </string-name>
          .
          <article-title>Optimizing distributed joins with bloom lters</article-title>
          .
          <source>In ICDCIT '08: Proceedings of the 5th International Conference on Distributed Computing and Internet Technology</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>2 Video available at: http://www.i .uzh.ch/ddis/ leadmin/basca/avalanche-demo.mov</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>3 This work was partially supported by the Swiss National Science Foundation under contract number 200021-118000</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>