Query Processing in Self-Organized Storage Systems

                                   Hannes Mühleisen
                  Supervisor: Robert Tolksdorf - PhD Research Phase 1
         Department of Computer Science - AG Networked Information Systems
        Freie Universität Berlin - Königin-Luise-Str. 24/26, 14195 Berlin, Germany
                                  muehleis@inf.fu-berlin.de


      Abstract. Storage systems are increasingly approaching their limits regarding
      system response to node overload and failure as well as overall scalability. Self-
      organized systems can be a solution to those issues. However, query processing
      research has not yet evolved to this area. This research proposal aims at extending
      distributed query processing to self-organized systems. The different components
      are discussed, and evaluation criteria for a emerging solution are laid out.


1   Problem statement

Database management systems and storage systems in general have evolved
over time to not only efficiently and securely store data on one computer, but
also to store this data in a organized network of many computers, with different
sharing architectures. Once data can be stored and - crudely - accessed, the need
for more sophisticated query patterns arises quickly. These are then formulated
using a variety of mostly declarative query languages. Well-known examples
for such languages include SQL for relational databases, and SPARQL for RDF
storage systems. While evaluating these queries on a single computer represents
a significant challenge, performing the same task in a distributed environment
is even more complex.
    One central component of any storage system supporting complex query
patterns is the Query Processor (QP), which takes the declarative complex queries
issued by the user and transforms them over various steps into an Query Execu-
tion Plan (QEP). The QEP then contains a detailed plan how to access, reorder,
and deliver the requested data to the client. This component relies on a map or
other lookup mechanism to determine which data elements are stored on which
node of the storage network [6, 10, 5]. If no such map or efficient lookup mech-
anism is available, query execution is severely impaired. The node handling
the query evaluation can no longer efficiently compute a QEP due to a lack of
knowledge of the topology of the data stored in the network. While the Peer-to-
Peer (P2P) community has developed the concept of mutable query plans, these
approaches are only viable in a structured P2P system so far [9, 3, 1].
2      Hannes Mühleisen

    A different approach to distributed storage systems are tuple-based self-
organizing storage systems using swarm algorithms. Starting from an approach
implementing the Linda coordination language, this method makes use of nature-
inspired algorithms to efficiently route storage and retrieval requests in an un-
structured network consisting of a potentially large amount of nodes. Systems
built with these algorithms have the potential to scale without inherent limits,
and are thus very interesting for further research. The lookup facility in this sys-
tem is currently limited to simple key-value requests [7]. The RDF data storage
concept with its unified triple model makes tuple storage systems interesting for
a large number of applications, and research is progressing on the subject [4].
However, complex query processing has not been addressed at this point, due to
the unstructured and undefined nature of the storage networks involved.
    The research problem will therefore be the design, implementation and eval-
uation of a efficient query processor for a self-organized storage system. If these
systems would contain such a component, it would enable them to replace sys-
tems using conventional methods while improving the scalability of the storage
system. Significant adaptations of the methods previously used to achieve dis-
tributed query processing are expected to be required in the scope of this work,
but the overall established structure can be maintained. It is however not yet
clear, if there are efficient solutions to this issue at all.
    The remainder of this paper is structured as follows: Section 2 formulates
the research questions and clarifies relevant assumptions, 3 shows the approach
followed in order to identify a solution, which is presented in Section 4. Section
5 shows the steps to be taken in order to evaluate the solution. Finally, Section
6 concludes this paper.


2   Main questions of the thesis

Given a self-organized storage system with no global data structures as they are
present in other self-organizing tuple storage systems [3], the main challenge
of this work is to determine whether it is possible to efficiently evaluate com-
plex data queries for those systems. An example for self-organizing systems
without global data structures are swarm-based approaches [2]. The ongoing
implementation efforts for storage systems using these approaches are currently
only considering simple lookup techniques [4, 8], thus the planned work will
yield considerable progress for this area of research.
     The combination of a swarm-based self-organizing tuple storage system
with a efficient query processing system could provide a big improvement for
tuple storage systems such as RDF stores. With its conceptually unlimited scala-
bility, the combination is able to handle the enormous amount of data generated
                             Query Processing in Self-Organized Storage Systems   3

by Future Internet applications and users while still being able to retrieve any in-
formation using complex query languages. Compared to structured Peer-to-Peer
approaches, swarm-based systems are better able to cope with ever-changing
network structure and load shifts.
     A number of prerequisites are assumed to be fulfilled, in order to focus en-
tirely on query processing. It is assumed, that there exists a system, which can be
run on a large number of connected computer systems (nodes). This system will
form an overlay network consisting of a neighborhood of known nodes for each
node. The entirety of nodes is able to store and locate tuples of arbitrary length.
Client computer systems can connect to any node using a specialized client API.
Clients can write tuples into the storage network by specifying them explicitly.
Clients can read tuples from the storage network by specifying fixed parts of
tuples (templates). All tuples matching the fixed parts given in the template are
then returned. A single node does not recognize more than its immediate neigh-
bor nodes. There is no method to retrieve all tuples stored in the network. While
retrieving data for a specific key, nodes can give an approximation, which neigh-
bor node may store or lead to matching tuples. Issues regarding the handling of
single key lookup operations or data storage are not within the scope of the
planned work.


3   General approach

In accordance with previous work in the area of distributed query processing, the
structure defined in that area is re-used in order to make the different approaches
comparable. The following components involved in query processing [6] have
been determined to be affected:

1. Query Optimizer and Plan Refinement Using a cost model considering
   either calculated values from static methods, general query heuristics, or
   statistical accumulation of knowledge, the query optimizer selects a execu-
   tion plan for the query. In a system without global knowledge, it may not be
   possible to converge on the optimal plan immediately. Instead, approaches
   like mutable query plans may be employed, where the plan is adapted and
   refined during actual execution [1, 9].
2. Catalog The catalog maintains a collection of metadata describing the dis-
   tribution and organization of data, their indicies, cardinalities, and further
   information all intended to assist the other components in performing their
   tasks. For a distributed system, the catalog may also be distributed. For a
   distributed relational database, this catalog defines all relations stored in the
   system, for an RDF store with its unified data model this is not required. In
4      Hannes Mühleisen

    our case, this information is not available to the single nodes, hence only
    heuristical and statistical methods are applicable to support the other com-
    ponents.
     The listed components have to be adapted fundamentally in order to func-
tion in a self-organized environment. The remaining components are expected to
remain largely untouched. The new query processing system will then be com-
pared with the current key-only lookup technique using a performance analy-
sis. A significant statistical improvement over naive approaches is expected for
complex queries.

4   Proposed solution
In order to build a complex query processing system on top of a self-organized
storage system, the base functionality of that system, in particular its ability to
retrieve data using a single lookup key, has to be verified first. For any tuple
stored in the system, the success rate for read operations requesting that tuple
has to be shown to be high enough to support reliable operations. In a second
step, existing query processing paradigms will be researched and adapted for our
self-organizing system. Considerable amount of theoretical work will precede
the implementation of one or various possible solutions on top of our existing
self-organizing storage system. The last step will be a comparison between the
new solution and the naive approach as well as existing solutions, where appli-
cable.

5   Evaluation
As we have shown, static methods for query processing may not be applicable
in our case. Hence, in order to evaluate our query processor, the emerging pro-
totype will be integrated into a self-organized storage system and tested against
a naive implementation. To make results comparable, an traditional distributed
query processing engine will also be tested against our approach while pay-
ing attention to the design goals for self-organized systems. A test data set as
close as possible to a real-world setup will be chosen and stored in the storage
network. A set of complex queries involving this data set will also be chosen
and executed. The average time to complete each query in multiple runs from
different nodes will be measured, and the average results compared.

6   Future work
The work described in this paper will continue with an exhaustive screening
of research regarding query processing in distributed environments. Using this
                                 Query Processing in Self-Organized Storage Systems          5

information, the road map to identify a set of possible solutions to the issue of
efficient query evaluation in an unstructured distributed storage systems will be
set up. After first phase of theoretical calculations and simulations on the set of
possible solutions, the most promising solutions will be selected for implemen-
tation into a existing self-organized distributed storage system. Benchmarking
runs and comparison with naive approaches such as flooding and random walk
will show the environment where the selected solutions yield an improvement.
It is hoped, that one of the selected solutions will cover a significant number of
environments. This solution will then be refined and evaluated furthermore.

Acknowledgments This work has been supported by the ”DigiPolis” project
funded by the German Federal Ministry of Education and Research (BMBF),
support code 03WKP07B.

References
 1. Philip A. Bernstein, Nathan Goodman, Eugene Wong, Christopher L. Reeve, and James B.
    Rothnie, Jr. Query processing in a system for distributed databases (sdd-1). ACM Trans.
    Database Syst., 6(4):602–625, 1981.
 2. Eric Bonabeau, Marco Dorigo, and Guy Theraulaz. Swarm Intelligence: From Natural to
    Artificial Systems. Santa Fe Institute Studies in the Sciences of Complexity Series. Oxford
    Press, July 1999.
 3. Min Cai. RDFPeers: A scalable distributed RDF repository based on A structured peer-
    to-peer network. Technical report, University of Southern California, Computer Science
    Department, April 02 2008.
 4. Marko Harasic, Anne Augustin, Philipp Obermeier, and Robert Tolksdorf. RDFSwarms -
    selforganized distributed RDF triple store. In Proceedings of the 2010 ACM Symposium on
    applied computing, ACM SAC 2010, 2010.
 5. George Kokkinidis, Lefteris Sidirourgos, and Vassilis Christophides. Query processing in
    RDF/S-based P2P database systems, 2008.
 6. Donald Kossmann. The state of the art in distributed query processing. ACM Comput. Surv,
    32(4):422–469, 2000.
 7. Ronaldo Menezes and Robert Tolksdorf. A new approach to scalable linda-systems based
    on swarms. In Proceedings of ACM SAC 2003, pages 375–379, 2003.
 8. Hannes Mühleisen, Kia Teymourian, and Robert Tolksdorf. A swarm-based semantic storage
    service. Poster Session at the 7th Extended Semantic Web Conference, ESWC10, 2010.
 9. Vassilis Papadimos, David Maier, and Kristin Tufte. Distributed query processing and cata-
    logs for peer-to-peer systems. In CIDR, 2003.
10. Bastian Quilitz and Ulf Leser. Querying distributed RDF data sources with SPARQL. In
    ESWC 2008, Proceedings, volume 5021 of Lecture Notes in Computer Science, pages 524–
    538. Springer, 2008.