<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Query Processing in Self-Organized Storage Systems</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Supervisor: Robert Tolksdorf - PhD Research Phase 1 Department of Computer Science - AG Networked Information Systems Freie Universita ̈t Berlin - Ko ̈nigin-Luise-Str.</institution>
          <addr-line>24/26, 14195 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Storage systems are increasingly approaching their limits regarding system response to node overload and failure as well as overall scalability. Selforganized systems can be a solution to those issues. However, query processing research has not yet evolved to this area. This research proposal aims at extending distributed query processing to self-organized systems. The different components are discussed, and evaluation criteria for a emerging solution are laid out.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A different approach to distributed storage systems are tuple-based
selforganizing storage systems using swarm algorithms. Starting from an approach
implementing the Linda coordination language, this method makes use of
natureinspired algorithms to efficiently route storage and retrieval requests in an
unstructured network consisting of a potentially large amount of nodes. Systems
built with these algorithms have the potential to scale without inherent limits,
and are thus very interesting for further research. The lookup facility in this
system is currently limited to simple key-value requests [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The RDF data storage
concept with its unified triple model makes tuple storage systems interesting for
a large number of applications, and research is progressing on the subject [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
However, complex query processing has not been addressed at this point, due to
the unstructured and undefined nature of the storage networks involved.
      </p>
      <p>The research problem will therefore be the design, implementation and
evaluation of a efficient query processor for a self-organized storage system. If these
systems would contain such a component, it would enable them to replace
systems using conventional methods while improving the scalability of the storage
system. Significant adaptations of the methods previously used to achieve
distributed query processing are expected to be required in the scope of this work,
but the overall established structure can be maintained. It is however not yet
clear, if there are efficient solutions to this issue at all.</p>
      <p>The remainder of this paper is structured as follows: Section 2 formulates
the research questions and clarifies relevant assumptions, 3 shows the approach
followed in order to identify a solution, which is presented in Section 4. Section
5 shows the steps to be taken in order to evaluate the solution. Finally, Section
6 concludes this paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Main questions of the thesis</title>
      <p>
        Given a self-organized storage system with no global data structures as they are
present in other self-organizing tuple storage systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the main challenge
of this work is to determine whether it is possible to efficiently evaluate
complex data queries for those systems. An example for self-organizing systems
without global data structures are swarm-based approaches [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The ongoing
implementation efforts for storage systems using these approaches are currently
only considering simple lookup techniques [
        <xref ref-type="bibr" rid="ref4 ref8">4, 8</xref>
        ], thus the planned work will
yield considerable progress for this area of research.
      </p>
      <p>The combination of a swarm-based self-organizing tuple storage system
with a efficient query processing system could provide a big improvement for
tuple storage systems such as RDF stores. With its conceptually unlimited
scalability, the combination is able to handle the enormous amount of data generated
by Future Internet applications and users while still being able to retrieve any
information using complex query languages. Compared to structured Peer-to-Peer
approaches, swarm-based systems are better able to cope with ever-changing
network structure and load shifts.</p>
      <p>A number of prerequisites are assumed to be fulfilled, in order to focus
entirely on query processing. It is assumed, that there exists a system, which can be
run on a large number of connected computer systems (nodes). This system will
form an overlay network consisting of a neighborhood of known nodes for each
node. The entirety of nodes is able to store and locate tuples of arbitrary length.
Client computer systems can connect to any node using a specialized client API.
Clients can write tuples into the storage network by specifying them explicitly.
Clients can read tuples from the storage network by specifying fixed parts of
tuples (templates). All tuples matching the fixed parts given in the template are
then returned. A single node does not recognize more than its immediate
neighbor nodes. There is no method to retrieve all tuples stored in the network. While
retrieving data for a specific key, nodes can give an approximation, which
neighbor node may store or lead to matching tuples. Issues regarding the handling of
single key lookup operations or data storage are not within the scope of the
planned work.
3</p>
    </sec>
    <sec id="sec-3">
      <title>General approach</title>
      <p>
        In accordance with previous work in the area of distributed query processing, the
structure defined in that area is re-used in order to make the different approaches
comparable. The following components involved in query processing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have
been determined to be affected:
1. Query Optimizer and Plan Refinement Using a cost model considering
either calculated values from static methods, general query heuristics, or
statistical accumulation of knowledge, the query optimizer selects a
execution plan for the query. In a system without global knowledge, it may not be
possible to converge on the optimal plan immediately. Instead, approaches
like mutable query plans may be employed, where the plan is adapted and
refined during actual execution [
        <xref ref-type="bibr" rid="ref1 ref9">1, 9</xref>
        ].
2. Catalog The catalog maintains a collection of metadata describing the
distribution and organization of data, their indicies, cardinalities, and further
information all intended to assist the other components in performing their
tasks. For a distributed system, the catalog may also be distributed. For a
distributed relational database, this catalog defines all relations stored in the
system, for an RDF store with its unified data model this is not required. In
our case, this information is not available to the single nodes, hence only
heuristical and statistical methods are applicable to support the other
components.
      </p>
      <p>The listed components have to be adapted fundamentally in order to
function in a self-organized environment. The remaining components are expected to
remain largely untouched. The new query processing system will then be
compared with the current key-only lookup technique using a performance
analysis. A significant statistical improvement over naive approaches is expected for
complex queries.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Proposed solution</title>
      <p>In order to build a complex query processing system on top of a self-organized
storage system, the base functionality of that system, in particular its ability to
retrieve data using a single lookup key, has to be verified first. For any tuple
stored in the system, the success rate for read operations requesting that tuple
has to be shown to be high enough to support reliable operations. In a second
step, existing query processing paradigms will be researched and adapted for our
self-organizing system. Considerable amount of theoretical work will precede
the implementation of one or various possible solutions on top of our existing
self-organizing storage system. The last step will be a comparison between the
new solution and the naive approach as well as existing solutions, where
applicable.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>As we have shown, static methods for query processing may not be applicable
in our case. Hence, in order to evaluate our query processor, the emerging
prototype will be integrated into a self-organized storage system and tested against
a naive implementation. To make results comparable, an traditional distributed
query processing engine will also be tested against our approach while
paying attention to the design goals for self-organized systems. A test data set as
close as possible to a real-world setup will be chosen and stored in the storage
network. A set of complex queries involving this data set will also be chosen
and executed. The average time to complete each query in multiple runs from
different nodes will be measured, and the average results compared.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Future work</title>
      <p>The work described in this paper will continue with an exhaustive screening
of research regarding query processing in distributed environments. Using this
information, the road map to identify a set of possible solutions to the issue of
efficient query evaluation in an unstructured distributed storage systems will be
set up. After first phase of theoretical calculations and simulations on the set of
possible solutions, the most promising solutions will be selected for
implementation into a existing self-organized distributed storage system. Benchmarking
runs and comparison with naive approaches such as flooding and random walk
will show the environment where the selected solutions yield an improvement.
It is hoped, that one of the selected solutions will cover a significant number of
environments. This solution will then be refined and evaluated furthermore.
Acknowledgments This work has been supported by the ”DigiPolis” project
funded by the German Federal Ministry of Education and Research (BMBF),
support code 03WKP07B.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Philip</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bernstein</surname>
          </string-name>
          , Nathan Goodman, Eugene Wong, Christopher L.
          <string-name>
            <surname>Reeve</surname>
          </string-name>
          , and
          <string-name>
            <surname>James</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rothnie</surname>
          </string-name>
          , Jr.
          <article-title>Query processing in a system for distributed databases (sdd-1)</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .,
          <volume>6</volume>
          (
          <issue>4</issue>
          ):
          <fpage>602</fpage>
          -
          <lpage>625</lpage>
          ,
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Eric</given-names>
            <surname>Bonabeau</surname>
          </string-name>
          , Marco Dorigo, and
          <string-name>
            <given-names>Guy</given-names>
            <surname>Theraulaz</surname>
          </string-name>
          .
          <source>Swarm Intelligence: From Natural to Artificial Systems. Santa Fe Institute Studies in the Sciences of Complexity Series</source>
          . Oxford Press,
          <year>July 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Min</given-names>
            <surname>Cai</surname>
          </string-name>
          .
          <article-title>RDFPeers: A scalable distributed RDF repository based on A structured peerto-peer network</article-title>
          .
          <source>Technical report</source>
          , University of Southern California, Computer Science Department, April 02
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Marko</given-names>
            <surname>Harasic</surname>
          </string-name>
          , Anne Augustin, Philipp Obermeier, and Robert Tolksdorf.
          <article-title>RDFSwarms - selforganized distributed RDF triple store</article-title>
          .
          <source>In Proceedings of the 2010 ACM Symposium on applied computing</source>
          ,
          <source>ACM SAC</source>
          <year>2010</year>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>George</given-names>
            <surname>Kokkinidis</surname>
          </string-name>
          , Lefteris Sidirourgos, and
          <string-name>
            <given-names>Vassilis</given-names>
            <surname>Christophides</surname>
          </string-name>
          .
          <source>Query processing in RDF/S-based P2P database systems</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Donald</given-names>
            <surname>Kossmann</surname>
          </string-name>
          .
          <article-title>The state of the art in distributed query processing</article-title>
          .
          <source>ACM Comput. Surv</source>
          ,
          <volume>32</volume>
          (
          <issue>4</issue>
          ):
          <fpage>422</fpage>
          -
          <lpage>469</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Ronaldo</given-names>
            <surname>Menezes</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Tolksdorf</surname>
          </string-name>
          .
          <article-title>A new approach to scalable linda-systems based on swarms</article-title>
          .
          <source>In Proceedings of ACM SAC</source>
          <year>2003</year>
          , pages
          <fpage>375</fpage>
          -
          <lpage>379</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Hannes Mu¨hleisen, Kia Teymourian, and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Tolksdorf</surname>
          </string-name>
          .
          <article-title>A swarm-based semantic storage service</article-title>
          .
          <source>Poster Session at the 7th Extended Semantic Web Conference, ESWC10</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Vassilis</given-names>
            <surname>Papadimos</surname>
          </string-name>
          , David Maier,
          <string-name>
            <given-names>and Kristin</given-names>
            <surname>Tufte</surname>
          </string-name>
          .
          <article-title>Distributed query processing and catalogs for peer-to-peer systems</article-title>
          .
          <source>In CIDR</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Bastian</given-names>
            <surname>Quilitz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ulf</given-names>
            <surname>Leser</surname>
          </string-name>
          .
          <article-title>Querying distributed RDF data sources with SPARQL</article-title>
          .
          <source>In ESWC</source>
          <year>2008</year>
          ,
          <article-title>Proceedings</article-title>
          , volume
          <volume>5021</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>524</fpage>
          -
          <lpage>538</lpage>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>