<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>No synopses</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Scalable Query Dissemination in XPeer</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giovanni Conforti</string-name>
          <email>confor@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Ghelli</string-name>
          <email>ghelli@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Manghi</string-name>
          <email>paolo.manghi@isti.cnr.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Sartiani</string-name>
          <email>sartiani@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica - Universiat` di Pisa -</institution>
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fachbereich Informatik Universiat ̈t Dortmund</institution>
        </aff>
      </contrib-group>
      <volume>9</volume>
      <fpage>6</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper presents XPeer, a data sharing system for massively distributed XML data. XPeer allows users to publish and query heterogeneous information without any significant administration eoffrts. XPeer tries to dispatch any given query to all and only the potentially relevant peers, exploiting a superpeer network to this aim. In today's world, the Internet affirmed as a powerful communication medium, allowing people from distant places to share and exchange information, as well as to interact. The Internet offers users many ways to communicate, such as forums, blogs, email, instant messages, voip and conference applications. In a similar way, complex global-scale applications can be built by composing services dispatched by single sites (e.g., web services), so to provide new and sophisticated tools to large and geographically distributed organizations. While much emphasis is posed on the role of the Internet as a communication medium, this vision is someway limiting. Indeed, the Internet can also be seen as a formidable, massively distributed data repository, containing user-supplied information about near all knowledge fields. This repository is characterized by some ground properties, mostly induced by the behavior of data providers (typically, net-users) and by the characteristics of data being provided. These properties can be described as heterogeneity, autonomy, and no administration.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Heterogeneity Data published on the Internet and, more generally, information
exchanged on the Net are by nature heterogeneous. Not only they span over
completely unrelated domains (e.g., espresso machines reviews and discussions
about LaTeX), but also data referring to the same domain are represented in
quite different ways.</p>
      <p>With only a few notable exceptions, heterogeneity is definitely not
avoidable, and it is common to data published by single net-users as well as to data
exported by applications. Even though some application efilds have very well
defined standard data representation formats (e.g., the logs of SMTP servers can be
assumed to be homogeneous), the same does not happen for some very popular
applications. Consider, for instance, multimedia players like iTunes, WinAmp,
and MediaPlayer: although they manage essentially the same kind of
information (usually mp3 or AAC music files), they organize their internal metadata
database in very different and usually incompatible ways. 1</p>
      <p>When building an Internet-scale application, managing heterogeneity is, to
some extent, an inevitable (and disturbing) issue. In particular, heterogeneity
plays a signicfiant role in any large scale data sharing system, where users are
allowed to share information without a superimposed global schema. For instance,
in the above example of multimedia players, one can think of a system
allowing users to transparently share descriptions, comments, and ratings about the
multimedia lfies they legally own: as each kind of player represents its internal
database according to a different schema (differences spread from efild names to
data nesting), some efforts for managing heterogeneity are necessary for making
data available to all users.</p>
      <p>Autonomy Strictly related to heterogeneity, a second ground property of the
Internet as massively distributed repository is data providers’ and data sources
autonomy. As data providers are (almost) free to publish whatever kind of
information they want, they are also free to add new contents, as well as to modify
and even drop existing contents they already published. For instance, a blogger
has usually the full control on the information she is publishing, hence she can
add new posts, modify existing posts, or delete old ones whenever she wants.</p>
      <p>Of course, autonomy is someway limited by replication and gossiping
phenomena that are intrinsic to the nature of the Internet.</p>
      <p>No administration A key factor in the success of the Internet as global
communication medium and global-scale repository is that publishing new information
(or using even sophisticated Internet services) requires no or little
administration efforts by a net-user. For instance, setting up a new blog can be done in
a few minutes by a non-expert user and requires only a few and non-technical
information about the new blog. Furthermore, popular file sharing systems
currently being used for sharing large amounts of video and/or music files do not
even require significant administration efforts, as they are able to self-manage
their distributed architecture and their congfiuration.</p>
      <sec id="sec-1-1">
        <title>Our Contribution</title>
        <p>
          We emphasized the role of the Internet as a massively distributed, global-scale
data repository. Till now, database technology has not been able to replicate the
success of the Internet in building large or global-scale databases. As pointed
out in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], the reasons of this failure are mostly related to common features of
current database systems, such as ACID transactions, that are not adequate to
a global-scale environment (and they are sometimes even an obstacle).
1 We are still wondering why simple information can be modeled in so many dieffrent
ways.
        </p>
        <p>
          The main objective of this paper is to present XPeer [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], a data sharing
system for massively distributed XML data. XPeer allows users to publish and
query heterogeneous information without any signicfiant administration efforts.
To this end, XPeer is based on a p2p architecture that is able to self-organize and
self-manage its own administrative layers without the intervention of a database
or system administrator.
        </p>
        <p>
          Unlike similar projects like PIER [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], XPeer recognizes that heterogeneity is
unavoidable and assists the user in surviving heterogeneity, i.e., XPeer assumes
that data are potentially heterogeneous and tries to dispatch any given query
to any potentially relevant peer, while retaining a good degree of selectivity
in query dissemination. The technique used is based on automatic extraction
of schematic information from peers and use of a query-to-schema matching
technique to identify the potentially relevant peers.
        </p>
        <p>The contribution of XPeer is twofold. First, XPeer offers very selective query
dissemination solutions, that allow the system to deliver a query only to a small
superset of the peers containing relevant data, hence reducing both
communication and execution costs. Second, its architectural design is scalable and
fault-tolerant, and can be easily adapted to more sophisticated data integration
techniques based on schema mappings and query reformulation.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Paper Outline</title>
        <p>The paper is structured as follows. Section 2 outlines the design choices on
which XPeer architecture is based. Section 3 describes the architecture of the
system. Section 4 illustrates the solutions used in XPeer for query dissemination,
while Section 5 describes the query execution strategy. Section 6 presents some
experimental results validating the XPeer approach. Sections 7 and 8, finally,
discuss related work and future research.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Architectural Choices</title>
      <p>XPeer main objective is the management of global-scale XML data repositories,
allowing users to share and query heterogeneous data without the need for any
administration activity.</p>
      <p>These goals require the adoption of architectural solutions enhancing both
the scalability and the reliability/availability properties of the system. In
particular, the architectural design of XPeer was inspired by the goal of distribution
scalability, i.e., the system performance should not deteriorate when the
number of nodes in the system increases. As the number of peers increases, both
the degree of data distribution and the magnitude of the query workload
increases, hence the system must cope with an increasing data tracking and query
processing load.</p>
      <p>In order to design a scalable data management system for heterogeneous
XML data, we made some architectural choices that depart from the tradition
of centralized and distributed databases, but can also be regarded as heretic wrt
usual p2p design criteria.</p>
      <p>Query Compilation vs Query Execution A selective and efficient query
dissemination is critical for the scalability of the XPeer system. To this end, query
processing is split in two distinct phases of query compilation and query
execution. During query compilation, a superpeer network cooperatively tries to
identify the peers containing relevant data, so to provide the issuing peer with
a quite precise query plan. During query execution, then, the issuing peer
deploys the query at the peers identiefid in the previous step, and coordinates the
execution of the distributed query.</p>
      <p>
        Separation between query compilation and query execution is unusual in
p2p systems. Much more frequently, queries are disseminated or routed while
being executed [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], hence dissemination choices are taken at query run-time.
The reason of this departure from the p2p tradition is twofold. First, we see
in this separation a way for improving query optimization: once received the
compiled access plan from the superpeer network, the issuing peer can apply
distributed query rewriting rules [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] so to decrease query execution costs; this in
turn leads to a significant enhancement of system scalability.
      </p>
      <p>Second (and most important), effectiveness and selectivity of query
dissemination can be significantly improved by exploiting a wider knowledge of the
system properties, managed by a specialized network, while the dissemination of
queries on the basis of local information may lead to decisions that are only
locally promising. We validate this claim in Section 6, where several experimental
results show that XPeer performs query dissemination in a quite selective way.
Non-DHT Superpeer Network Supporting an effective and efficient query
compilation phase requires the adoption of an adequate architecture for the superpeer
network. In our vision, the superpeer network should track the peers connected
to the system and store some form of summary of the content of each peer
(describing both the structure and the data distribution); indeed, any peer joining
the system submits such a summary of its data to the superpeer network, and
should refresh it every time a local update of its data leads to a change in its
schema or in the data distribution. By aggregating and manipulating these
summaries, the superpeer network is able to identify those peers having potentially
relevant data, even if their schemas are not homogeneous.</p>
      <p>A key point of this vision is that complex schema manipulations are necessary
for surviving heterogeneity and identifying query-relevant peers. To lower both
communication and processing costs during schema manipulation, schemas and
synopses should be stored in a very localized way, i.e., without any
fragmentation. Hence, we decided to drop architectures based on Distributed Hash Tables
(DHTs) in favor of a dynamically hierarchical organization of the superpeer
network, allowing the superpeer network to store schemas without
fragmentation. We understand that this choice is somewhat unusual and, to some extent,
heretic. Despite the very good scalability properties of DHTs, we believe that
DHTs are not adequate to our aims. Indeed, DHTs fragment data in a very nfie
way, so to maximize load balancing and parallelism; unfortunately, this aspect
significantly increases communication and processing costs of any activity with
limited parallelism, including schema manipulation.</p>
      <p>p1
The starting point of our architecture is a set of autonomous peers. To make
this nebula act as a distributed repository and query system, XPeer builds an
hierarchical tree-shaped superpeer network over it, as shown in Figure 1.</p>
      <p>On the leaves of the superpeer tree, each node (called superpeer) can spot a
distinct fragment of the nebula, i.e., it keeps track of the schemas and synopses
of peers in that fragment, so that the whole nebula is covered by leaf superpeers.
The summaries managed by these nodes are then tracked by superpeer nodes
in the upper level; in this way, nebula spots are aggregated to allow superpeers
to have a wider view of the peer network. This aggregation introduces some
form of approximation in the combined summaries, to balance the larger set
of information to be managed. This aggregation process is repeated recursively
until the root of the tree hierarchy is reached. The superpeer network root, hence,
has a high level vision of the whole peer network.</p>
      <p>
        An important issue in this hierarchical process is given by summary (schemas
and synopses) representation and aggregation. The current implementation of
XPeer uses a DataGuide-like structure (called treeguide) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for representing
peer schemas, which are automatically inferred by the system. Data distribution
inside XML simple elements is captured by means of equi-depth histograms and
Bloom filters: histograms are used for numeric (integer/floats) element only,
while Bloom filters come into play for string-valued elements. Histograms, Bloom
filters, and treeguides are integrated by endowing each treeguide leaf with the
corresponding histogram or Bloom lfiter, if any.
      </p>
      <p>Based on these representations, summary aggregation is quite
straightforward: treeguides are merged, while histograms are combined to lower the space
requirements. In particular, buckets are reshaped to preserve equi-depth. As
usual, Bloom lfiters are combined through disjunction.</p>
      <p>
        Hierarchical structures in p2p systems, with only a few exceptions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], are
known to be prone to scalability, reliability, and adaptivity issues. To overcome
these problems and make XPeer scale with the size of the peer network, XPeer
adopts a technique called cloning, which is essentially farm-like replication, with
weak synchronization requirements.
      </p>
      <p>The basic idea of cloning is that a superpeer is not a physical entity, but,
instead, a virtual entity consisting of several nodes cooperating in managing
queries and updates. Each of these nodes is called a clone, as it replicates the
information managed by the virtual superpeer where it resides. It is important
to notice that a clone is not a dedicated machine or supercomputer; instead,
any peer in the system may become a clone, once it has expressed its willing to
perform administrative tasks. A superpeer, hence, is a virtual entity consisting
of heterogeneous and geographically distributed clones.</p>
      <p>From outside a superpeer, clones are invisible and each of them is able to
handle both query compilation requests and summary update requests. Each
request to a superpeer is routed to a randomly chosen clone by an enhanced
communication and transport layer, which is clone-aware, and which is an
essential part of the communication infrastructure on top of which XPeer is built.</p>
      <p>A relevant issue for the proper work of clones is summary synchronization. As
superpeer information is fully replicated among clones, and, as each clone can
independently handle summary update requests, a synchronization algorithm
among clones, ensuring (some sort of) consistency and scalability, is necessary.
We chose to favor scalability on consistency by adopting a synchronization
algorithm based on a best-effort linear synchronization scheme, where, every time a
clone receives an update request, it forwards the request to all its sibling clones.
Retransmission mechanisms based on version numbers and hash signatures are
employed for dealing with synchronization message failure.</p>
      <p>We can now see how cloning helps both reliability and scalability. For what
concerns reliability, a superpeer failure requires that all clones fail, which is quite
infrequent.</p>
      <p>For what concerns scalability, the number of clones in a superpeer is
continuously modified to match the load of incoming and outgoing messages. In
particular, each clone in a superpeer periodically monitors the length of its
message queues: when they exceed a given threshold the clone tries to recruit a new
peer to join the superpeer.2 This process is performed by the system with no
human intervention.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Query Compilation</title>
      <p>Query compilation in XPeer aims at transforming a query in an algebraic
expression where each leaf is a location operator, identifying a data source that is
found in a specific peer. The core step of this process is distributed compilation,
2 Clone failures are managed essentially in the same way, as they impact the messaging
load of the remaining clones.
where the superpeer network tries to identify interesting peers on the basis of
their summaries, so that only such peers appear in the compiled query.
4.1</p>
      <sec id="sec-3-1">
        <title>Query Language and Query Algebra</title>
        <p>
          XPeer supports a significant fragment of XQuery [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], roughly equivalent to the
FLWR core of the language. The XPeer query algebra [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] contains operators for
evaluating path and twigs, for filtering variable bindings according to predicates,
for building new XML fragments, and for incorporating peer information inside
query plans. This algebra is an extension of that described in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], and is close to
other algebras for semistructured and XML data [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
4.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Local Compilation</title>
        <p>Local compilation translates a query into an incomplete algebraic expression,
i.e., an algebraic expression without location operators. The query is rfist
translated, by the issuing peer, into an intermediate representation based on query
blocks; on this representation standard rewriting techniques are applied, like, for
instance, the factorization of common subexpressions. Finally, the intermediate
representation is used for generating the algebraic expression corresponding to
the query.
4.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Distributed Compilation</title>
        <p>Once a peer px issuing a query q has locally compiled the query into an incomplete
algebraic expression, it submits the algebraic representation of q to the superpeer
network. The superpeer network, hence, lfils the holes in the algebraic expression
by identifying a set of peers containing potentially relevant data; this process is
guided by the summary information stored in the superpeer network.</p>
        <p>For an example, we will refer to the query shown below, assuming that the
query is satisfied by peers p1, p13, and p17, and that the superpeer network has
the structure shown in Figure 2. This query retrieves the titles of all
undergraduate courses from a university courses database.
for $c in $db//course,</p>
        <p>$l in $c/level
where $l = "U"
return $c/title</p>
        <p>The distributed compilation step is then organized in two phases: the
ascending phase, and the descending phase.</p>
        <p>Ascending phase px submits the query q, by sending its compiled version to its
superpeer (sp1, in this case). Once received the query, this superpeer i) propagates
the query to its father in the hierarchy, and ii) matches the query against schemas
and synopses locally hosted, as shown in Figure 2(a). Any positive match (p1)
is then directly communicated to px.</p>
        <p>As the father of sp1 (sp19) receives the query from sp1, it recursively forwards
the query to its father (the root of the hierarchy, in this case) and matches
the query against its schemas and synopses. Unlike the match performed by
leaf superpeers, which can directly identify interesting peers, the match at an
intermediate level serves the purpose of finding hierarchy subtrees that may
contain information about peers with relevant data. In the case of our example,
sp19 finds that sp27 children may comprise interesting peers, hence sp19 forwards
the query to sp27 (see Figure 2(b)). sp27 will perform exactly the same actions as
sp1, with the only difference that it will not resend the query to its own father.
Descending phase When the root of the tree hierarchy receives a query from one
of its children, the ascending phase for this query ends, and the descending phase
starts. The purpose of this phase is to explore the fragment of the hierarchy that
has not been touched by the ascending phase. This exploration is performed on
a summary-driven basis, hence only the subtrees that track potentially relevant
peers are actually explored. In the case of our example, the root nfids that the
subtree rooted by sp90 is worth a further exploration, hence it propagates the
query to sp90 (Figure 2(c)). sp90, in turn, discovers that sp99 may have
information on interesting peers, while sp94 does not match the query; hence, sp90
sends the query to sp99 (Figure 2(d)). sp99, nfially, identifies p17 as a potential
data supplier, and sends this information directly to px.</p>
        <p>Two important things must be noted about distributed compilation. First,
each query must reach the root of the hierarchy, as the root only has some
knowledge about the whole peer network. This, in turn, means that the superpeer
load increases while moving from the bottom to the top of the hierarchy. This
is not a problem, since each superpeer just recruits as many clones as necessary
to perform its task.</p>
        <p>Second, the superpeer network sends the information about any interesting
peer as soon as it is found. This allows for an improvement in the compilation
response time.
4.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Query-to-schema Match</title>
        <p>A key aspect of the query compilation approach of XPeer is represented by
the algorithm being used for matching a query against a schema and a set of
synopses. This algorithm guides the compilation process and allows the system
to identify peers that may satisfy the query as well as to avoid the traversal of
fragments of the superpeer network that do not contribute to the compilation
result.</p>
        <p>The query-to-schema match algorithm works in two main steps. During the
first step, the algorithm extracts the twigs (i.e., the “tree patterns”) from a given
query (after clause normalization), and “evaluates” the twigs over the schema:
if the result of this evaluation is not empty, there may exist some instance of the
schema that contains data satisfying the structural requirements of the query. To
improve the selectivity of the matching process, the algorithm then compares the
sp1</p>
        <p>root
sp34
root
sp34
root
sp34
root
sp34
sp90
sp90
sp99
sp99
sp90
sp90
sp99</p>
        <p>sp1
sp99
sp1
sp19
sp3
sp19
sp3
sp27
...
sp27
...
p13
...</p>
        <p>p17
px ... p1
p13
...</p>
        <p>p17
(a) Ascending phase: leaf
superpeers
(b) Ascending phase: intermediate
superpeers
p13
...</p>
        <p>p17
px ... p1
p13
...</p>
        <p>p17
(c) Descending phase: root
(d) Descending phase:
intermediate superpeers
Fig. 2. Distributed compilation.
institution</p>
        <p>course
root
level</p>
        <p>title
predicates specified in the where clauses of the query with the statistical synopses
associated with the schema, so to discard those peers that give no contribution
to the query result. The following Example illustrates the algorithm and the use
of synopses.</p>
        <p>Example 1. Consider the query of the previous Section, and assume that p1 data
are described by the schema shown in Figure 3.</p>
        <p>Suppose that sp1 checks whether p1 may contain relevant data. To this
purpose, sp1 interprets p1 treeguide as a data tree, and executes the binding
fragment of the query on it. In our case, sp1 evaluates for $c in $db//course,
$l in $c/level on the treeguide, and returns a non-empty tuple where $c and
$l are bound to the matching schema nodes.</p>
        <p>This tuple tells sp1 that p1 data satisfy the structural requirements of the
query; however, p1 data may not comprise level elements whose content is “U”.
As a consequence, a supplementary check is needed to enforce the satisfaction of
the where clause predicate. To this end, sp1 accesses the Bloom filter associated
to the level node of the schema, and just verifies if the set of strings described
by the filter includes “U”.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Query Execution</title>
      <p>
        Query execution strategy of XPeer is quite simple, as it involves no other nodes
but the query issuer and the data sources, and it exploits the simplest
communication pattern. The input of this phase is an algebraic expression, where every
data source has been identified during the distributed compilation phase. The
peer issuing the query decomposes this algebraic expression into subexpressions,
called pipes, to be delivered to and executed by remote peers, and coordinates
remote peer execution. As usual, XPeer tries to minimize network traffic by
pushing selections and twig evaluation down the tree, exploiting canonical algebraic
rewriting rules [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The decomposed query is then formed by several pipes. Each given pipe
contains an algebraic expression describing the query fragment that a given
remote peer must execute. Any operation involving data coming from multiple
peers is executed by the issuing peer: the corresponding pipe, which essentially
coordinates the execution of the whole query, is called the host pipe.</p>
      <p>After query decomposition, pipes are sent to remote peers and executed.
When a remote peer receives a pipe P, it compiles the algebraic expression
inside P into a physical plan and waits for a start-execution message.</p>
      <p>Execution inside a given pipe follows the iterative model, where new results
are requested by means of next messages. Interaction among pipes, instead,
follows an asynchronous buffered iterative model. To avoid deadlocks and to
be resilient to node and network failures, interactions are asynchronous and
subject to time-outs: indeed, the operator used for managing communications
with children pipes sends asynchronous next requests to the children pipes; if no
response is given after a certain period of time, it assumes that the corresponding
pipe is blocked or that no more data are available, hence it stops contacting the
remote peer.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <p>A crucial feature of XPeer is its ability to dispatch a given query to a relatively
small superset of the peers with relevant data, avoiding dissemination policies
based on broadcast or flooding. In this Section we will present experimental
results validating this claim; these results show that, by relying on the
queryto-schema match algorithm, the superpeer network of XPeer is able to discard
a signicfiant fraction of peers containing irrelevant data.
6.1</p>
      <sec id="sec-5-1">
        <title>Experimental Setup</title>
        <p>Experiments were performed on a 100-node peer network, running on a
cluster of Linux machines. We simulate an application where a set of
universities publish information on their courses, and the data of each university is
structured according to one among three different schemas. To this aim, we
started from the three lfies in the University Courses XML dataset, available at
http://www.cs.washington.edu/ research/xmldatasets/; these XML
documents have been randomly fragmented and each fragment has been assigned to
a peer. This setup exempliefis a typical situation where a number of sites publish
information with a limited degree of dishomogeneity.</p>
        <p>The behavior of the system was controlled and observed through XOrch, a
global orchestration tool that we developed inside the XPeer project. XOrch can
control the behavior of peers and superpeers by means of scores, i.e., scripts
represented in a CSP-based language and describing the actions a single peer or
superpeer must execute; scores are sent by the Orc (i.e., the central monitoring
component) to local proxies, which execute them by interacting with the attached
peers.
The test query workload is formed by 10 XQuery queries, divided into four classes
on the basis of a qualitative selectivity estimation. The first class contains queries
with rather selective twigs, while the second class focuses on selective predicates;
the third class comprises queries where both twigs and predicates are deemed
as selective; the fourth class, finally, contains negative queries only, i.e. queries
with no answer.</p>
        <p>We assembled our test workload with the aim of mimicking “real life”
workloads, hence the test queries contain both / and // operations, union paths, and
multiple predicates connected by and/or logical connectors.</p>
        <p>More information about the test workload as well as supplementary test
results can be found at http://datatop2.di.unipi.it/experiments.
6.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Experiments</title>
        <p>In our experiments, we measured precision and recall of compilation for positive
queries, as well as the absolute compilation error for negative queries, i.e., the
difference between the number of peers deemed as interesting by the compilation
process and the number of peers actually hosting relevant data; we performed
these evaluations for four different compilation configurations. In the rfist
configuration, we did not use synopses (Bloom lfiters and histograms) at all, hence
limiting the query-to-schema match to a purely structural one; in the remaining
congfiurations we used synopses of increasing dimensions: 96-bit, 480-bit, and
4800-bit Bloom lfiters, and histograms containing up to 15, 30, and 70 values,
respectively.</p>
        <p>Results of our experiments are shown in Figures 4 and 5. Since both schemas
and synopses are upper approximations of the actual peer data, the recall should
always be 100% when data is neither updated nor lost; the experiments conrfim
this, hence we only plot precision. As it can be noted, compilation is quite precise
even in the absence of synopses; only query Q6 showed poor results, probably
related to some issues concerning partial match string predicate.</p>
        <p>Synopses really come into play on negative queries; while the no-synopses
congfiuration generates significant compilation errors, the use of synopses allows
the system to correctly recognize negative queries.
7</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Related Works</title>
      <p>
        Several projects focus on the problem of evaluating structured and complex
queries in p2p systems. Among these projects, Pier [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is definitely the closest
system to XPeer. Pier is based on the use of a DHT, where relational,
homogeneous data are cached. Pier queries are executed by adapting techniques from
parallel databases to a DHT-based storage, hence exploiting the inner parallelism
of DHTs. As Pier is focused on homogeneous data and parallelizable queries, it
should be regarded as complementary wrt XPeer, which, instead, focuses on
heterogeneous, hierarchical data and mostly hierarchical query operators. The only
n
o
i
s
i
c
e
r
P
0.8
0.6
0.4
0.2
0
Q1
      </p>
      <p>Q2</p>
      <p>Q3</p>
      <p>Q4</p>
      <p>Q5</p>
      <p>Q6</p>
      <p>Q7</p>
      <p>Queries
significant limitation of Pier wrt XPeer is query dissemination, which, in most
cases, requires a broadcast on the whole peer network.</p>
      <p>
        Heterogeneity management is, instead, the focus of the Piazza project [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Piazza is essentially a p2p-structured data integration system for XML data,
where the data integration task is dispersed across the whole network. Peers
(which may represent data sources) are connected through schema mappings,
and query processing is based on a flooding algorithm that broadcasts queries
among peers. Piazza, hence, suffers from severe scalability and query
dissemination problems, that greatly limit its applicability.
      </p>
      <p>
        A different way of supporting heterogeneity is adopted in the system
described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where a coordinator-free architecture for distributed XML query
processing is presented. By assuming that peers contain semantically related
data, the system uses a multi-hierarchic organization of the domain space to
route queries (in the form of mutant query plans, i.e., query plans containing
materialized data too). This approach does not seem adequate when data are
semantically heterogeneous, and it makes query dissemination quite expensive.
      </p>
      <p>
        An interesting hybrid between structured and unstructured p2p systems is
represented by KadoP [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In KadoP a DHT is used for storing a distributed
full-text index about documents and web services; this index, together with
ontologies, is used during query processing for locating interesting data and
services. Resource location, hence, requires the system to perform several key
lookups in the DHT index, with a significant messaging cost, which depends on
the dimension of the query, as well as on the data and services involved by the
query.
8
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>The wide diffusion of the Internet has greatly increased the number of
datasources publicly available as well as the amount of available data. As a
consequence, a need for query systems able to match the scalability properties of the
Internet emerged. This paper presented XPeer, a query system for massively
distributed and heterogeneous XML data. XPeer allows users to perform structured
queries on globally distributed data, without the hassles of any administration
activity and while preserving full control on their own data. We described the
principles on which XPeer is based, as well as its architectural paradigm,
relying on a tree-shaped superpeer network; we also illustrated how a virtualization
technique called cloning can be used for ensuring both robustness and scalability
of the architecture.</p>
      <p>We showed in detail the techniques used for disseminating queries inside the
network: indeed, we illustrated how a query-to-schema matching technique can
be exploited for making query dissemination smart and precise, as conrfimed by
the experimental results we provided.</p>
      <p>In the near future, we plan to perform extensive experiments about
compilation, so to evaluate the selectivity of query dissemination in the presence of
massive network or node failures. We also plan to enhance the class of supported
predicates at compilation time, as well as to study the behavior of the system
on very large networks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Huebsch</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chun</surname>
            ,
            <given-names>B.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellerstein</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loo</surname>
            ,
            <given-names>B.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maniatis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roscoe</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yumerefendi</surname>
            ,
            <given-names>A.R.:</given-names>
          </string-name>
          <article-title>The architecture of pier: an internetscale query processor</article-title>
          .
          <source>In: CIDR</source>
          . (
          <year>2005</year>
          )
          <fpage>28</fpage>
          -
          <lpage>43</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Sartiani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghelli</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manghi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conforti</surname>
          </string-name>
          , G.:
          <article-title>XPeer: A self-organizing XML P2P database system</article-title>
          .
          <source>In: Proceedings of the First EDBT Workshop on P2P and Databases (P2P&amp;DB</source>
          <year>2004</year>
          ),
          <year>2004</year>
          . (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Papadimos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tufte</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Distributed Query Processing and Catalogs for Peer-to-Peer Systems</article-title>
          .
          <source>In: CIDR 2003, First Biennial Conference on Innovative Data Systems Research</source>
          , Asilomar, CA, USA, January 5-
          <issue>8</issue>
          ,
          <year>2003</year>
          . (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Sartiani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A query algebra for xml p2p databases</article-title>
          .
          <source>In: Proceedings of the 11th International Workshop on Foundations of Models</source>
          and
          <article-title>Languages for Data and Objects (FMLDO)</article-title>
          .
          <source>In conjunction with the 10th Int. Conference on Extending Database Technology (EDBT</source>
          <year>2006</year>
          ).
          <article-title>(</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Goldman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widom</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>DataGuides: Enabling query formulation and optimization in semistructured databases</article-title>
          .
          <source>In: VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29</source>
          ,
          <year>1997</year>
          , Athens, Greece, Morgan Kaufmann (
          <year>1997</year>
          )
          <fpage>436</fpage>
          -
          <lpage>445</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jagadish</surname>
            ,
            <given-names>H.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ooi</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vu</surname>
            ,
            <given-names>Q.H.</given-names>
          </string-name>
          :
          <article-title>Baton: A balanced tree structure for peerto-peer networks</article-title>
          .
          <source>In: VLDB</source>
          . (
          <year>2005</year>
          )
          <fpage>661</fpage>
          -
          <lpage>672</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Boag</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chamberlin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Florescu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <source>Simeo´n, J.: XQuery 1</source>
          .0:
          <string-name>
            <surname>An</surname>
            <given-names>XML</given-names>
          </string-name>
          <string-name>
            <surname>Query</surname>
          </string-name>
          <article-title>Language</article-title>
          .
          <source>Technical report, World Wide Web Consortium</source>
          (
          <year>2005</year>
          )
          <article-title>W3C Candidate Recommendation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sartiani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Albano</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Yet Another Query Algebra For XML Data</article-title>
          . In Nascimento, M.A., O¨zsu, M.T., Zıa¨ ne, O., eds.
          <source>: Proceedings of the 6th International Database Engineering and Applications Symposium (IDEAS</source>
          <year>2002</year>
          ), Edmonton, Canada,
          <source>July 17-19</source>
          ,
          <year>2002</year>
          . (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Paparizos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jagadish</surname>
            ,
            <given-names>H.V.</given-names>
          </string-name>
          :
          <article-title>Pattern tree algebras: Sets or sequences</article-title>
          ? In: VLDB. (
          <year>2005</year>
          )
          <fpage>349</fpage>
          -
          <lpage>360</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mork</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tatarinov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Piazza: data management infrastructure for semantic web applications</article-title>
          .
          <source>In: Proceedings of the Twelfth International World Wide Web Conference, WWW2003</source>
          , Budapest, Hungary,
          <fpage>20</fpage>
          -
          <lpage>24</lpage>
          May
          <year>2003</year>
          , ACM (
          <year>2003</year>
          )
          <fpage>556</fpage>
          -
          <lpage>567</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Abiteboul</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manolescu</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Preda</surname>
          </string-name>
          , N.:
          <article-title>Constructing and querying peer-to-peer warehouses of xml resources</article-title>
          .
          <source>In: ICDE</source>
          . (
          <year>2005</year>
          )
          <fpage>1122</fpage>
          -
          <lpage>1123</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>