Query Processing in Self-Organized Storage Systems

Query Processing in Self-Organized Storage Systems HannesMühleisen Department of Computer Science -AG Networked Information Systems Freie Universität Berlin

-Königin-Luise-Str. 24/26 14195 Berlin Germany

RobertTolksdorf -Phd Department of Computer Science -AG Networked Information Systems Freie Universität Berlin

-Königin-Luise-Str. 24/26 14195 Berlin Germany

ResearchPhase Department of Computer Science -AG Networked Information Systems Freie Universität Berlin

-Königin-Luise-Str. 24/26 14195 Berlin Germany

Query Processing in Self-Organized Storage Systems 0146517B4EC3376EAFA0426AA909AC74 GROBID - A machine learning software for extracting information from scholarly documents

Storage systems are increasingly approaching their limits regarding system response to node overload and failure as well as overall scalability. Selforganized systems can be a solution to those issues. However, query processing research has not yet evolved to this area. This research proposal aims at extending distributed query processing to self-organized systems. The different components are discussed, and evaluation criteria for a emerging solution are laid out.

Problem statement

Database management systems and storage systems in general have evolved over time to not only efficiently and securely store data on one computer, but also to store this data in a organized network of many computers, with different sharing architectures. Once data can be stored and -crudely -accessed, the need for more sophisticated query patterns arises quickly. These are then formulated using a variety of mostly declarative query languages. Well-known examples for such languages include SQL for relational databases, and SPARQL for RDF storage systems. While evaluating these queries on a single computer represents a significant challenge, performing the same task in a distributed environment is even more complex.

One central component of any storage system supporting complex query patterns is the Query Processor (QP), which takes the declarative complex queries issued by the user and transforms them over various steps into an Query Execution Plan (QEP). The QEP then contains a detailed plan how to access, reorder, and deliver the requested data to the client. This component relies on a map or other lookup mechanism to determine which data elements are stored on which node of the storage network [6,10,5]. If no such map or efficient lookup mechanism is available, query execution is severely impaired. The node handling the query evaluation can no longer efficiently compute a QEP due to a lack of knowledge of the topology of the data stored in the network. While the Peer-to-Peer (P2P) community has developed the concept of mutable query plans, these approaches are only viable in a structured P2P system so far [9,3,1].

A different approach to distributed storage systems are tuple-based selforganizing storage systems using swarm algorithms. Starting from an approach implementing the Linda coordination language, this method makes use of natureinspired algorithms to efficiently route storage and retrieval requests in an unstructured network consisting of a potentially large amount of nodes. Systems built with these algorithms have the potential to scale without inherent limits, and are thus very interesting for further research. The lookup facility in this system is currently limited to simple key-value requests [7]. The RDF data storage concept with its unified triple model makes tuple storage systems interesting for a large number of applications, and research is progressing on the subject [4]. However, complex query processing has not been addressed at this point, due to the unstructured and undefined nature of the storage networks involved.

The research problem will therefore be the design, implementation and evaluation of a efficient query processor for a self-organized storage system. If these systems would contain such a component, it would enable them to replace systems using conventional methods while improving the scalability of the storage system. Significant adaptations of the methods previously used to achieve distributed query processing are expected to be required in the scope of this work, but the overall established structure can be maintained. It is however not yet clear, if there are efficient solutions to this issue at all.

The remainder of this paper is structured as follows: Section 2 formulates the research questions and clarifies relevant assumptions, 3 shows the approach followed in order to identify a solution, which is presented in Section 4. Section 5 shows the steps to be taken in order to evaluate the solution. Finally, Section 6 concludes this paper.

Main questions of the thesis

Given a self-organized storage system with no global data structures as they are present in other self-organizing tuple storage systems [3], the main challenge of this work is to determine whether it is possible to efficiently evaluate complex data queries for those systems. An example for self-organizing systems without global data structures are swarm-based approaches [2]. The ongoing implementation efforts for storage systems using these approaches are currently only considering simple lookup techniques [4,8], thus the planned work will yield considerable progress for this area of research.

The combination of a swarm-based self-organizing tuple storage system with a efficient query processing system could provide a big improvement for tuple storage systems such as RDF stores. With its conceptually unlimited scalability, the combination is able to handle the enormous amount of data generated by Future Internet applications and users while still being able to retrieve any information using complex query languages. Compared to structured Peer-to-Peer approaches, swarm-based systems are better able to cope with ever-changing network structure and load shifts.

A number of prerequisites are assumed to be fulfilled, in order to focus entirely on query processing. It is assumed, that there exists a system, which can be run on a large number of connected computer systems (nodes). This system will form an overlay network consisting of a neighborhood of known nodes for each node. The entirety of nodes is able to store and locate tuples of arbitrary length. Client computer systems can connect to any node using a specialized client API. Clients can write tuples into the storage network by specifying them explicitly. Clients can read tuples from the storage network by specifying fixed parts of tuples (templates). All tuples matching the fixed parts given in the template are then returned. A single node does not recognize more than its immediate neighbor nodes. There is no method to retrieve all tuples stored in the network. While retrieving data for a specific key, nodes can give an approximation, which neighbor node may store or lead to matching tuples. Issues regarding the handling of single key lookup operations or data storage are not within the scope of the planned work.

General approach

In accordance with previous work in the area of distributed query processing, the structure defined in that area is re-used in order to make the different approaches comparable. The following components involved in query processing [6] have been determined to be affected:

1. Query Optimizer and Plan Refinement Using a cost model considering either calculated values from static methods, general query heuristics, or statistical accumulation of knowledge, the query optimizer selects a execution plan for the query. In a system without global knowledge, it may not be possible to converge on the optimal plan immediately. Instead, approaches like mutable query plans may be employed, where the plan is adapted and refined during actual execution [1,9]. 2. Catalog The catalog maintains a collection of metadata describing the distribution and organization of data, their indicies, cardinalities, and further information all intended to assist the other components in performing their tasks. For a distributed system, the catalog may also be distributed. For a distributed relational database, this catalog defines all relations stored in the system, for an RDF store with its unified data model this is not required. In our case, this information is not available to the single nodes, hence only heuristical and statistical methods are applicable to support the other components.

The listed components have to be adapted fundamentally in order to function in a self-organized environment. The remaining components are expected to remain largely untouched. The new query processing system will then be compared with the current key-only lookup technique using a performance analysis. A significant statistical improvement over naive approaches is expected for complex queries.

Proposed solution

In order to build a complex query processing system on top of a self-organized storage system, the base functionality of that system, in particular its ability to retrieve data using a single lookup key, has to be verified first. For any tuple stored in the system, the success rate for read operations requesting that tuple has to be shown to be high enough to support reliable operations. In a second step, existing query processing paradigms will be researched and adapted for our self-organizing system. Considerable amount of theoretical work will precede the implementation of one or various possible solutions on top of our existing self-organizing storage system. The last step will be a comparison between the new solution and the naive approach as well as existing solutions, where applicable.

Evaluation

As we have shown, static methods for query processing may not be applicable in our case. Hence, in order to evaluate our query processor, the emerging prototype will be integrated into a self-organized storage system and tested against a naive implementation. To make results comparable, an traditional distributed query processing engine will also be tested against our approach while paying attention to the design goals for self-organized systems. A test data set as close as possible to a real-world setup will be chosen and stored in the storage network. A set of complex queries involving this data set will also be chosen and executed. The average time to complete each query in multiple runs from different nodes will be measured, and the average results compared.

Future work

The work described in this paper will continue with an exhaustive screening of research regarding query processing in distributed environments. Using this information, the road map to identify a set of possible solutions to the issue of efficient query evaluation in an unstructured distributed storage systems will be set up. After first phase of theoretical calculations and simulations on the set of possible solutions, the most promising solutions will be selected for implementation into a existing self-organized distributed storage system. Benchmarking runs and comparison with naive approaches such as flooding and random walk will show the environment where the selected solutions yield an improvement. It is hoped, that one of the selected solutions will cover a significant number of environments. This solution will then be refined and evaluated furthermore.

Acknowledgments This work has been supported by the "DigiPolis" project funded by the German Federal Ministry of Education and Research (BMBF), support code 03WKP07B.

Query processing in a system for distributed databases (sdd-1) PhilipABernstein NathanGoodman EugeneWong ChristopherLReeve JamesBRothnieJr ACM Trans. Database Syst 6 4 1981 Swarm Intelligence: From Natural to Artificial Systems EricBonabeau MarcoDorigo GuyTheraulaz Santa Fe Institute Studies in the Sciences of Complexity Series Oxford Press July 1999 RDFPeers: A scalable distributed RDF repository based on A structured peerto-peer network MinCai April 02 2008 University of Southern California, Computer Science Department Technical report RDFSwarmsselforganized distributed RDF triple store MarkoHarasic AnneAugustin PhilippObermeier RobertTolksdorf Proceedings of the 2010 ACM Symposium on applied computing the 2010 ACM Symposium on applied computing ACM SAC 2010. 2010 Query processing in RDF/S-based P2P database systems GeorgeKokkinidis LefterisSidirourgos VassilisChristophides 2008 The state of the art in distributed query processing DonaldKossmann ACM Comput. Surv 32 4 2000 A new approach to scalable linda-systems based on swarms RonaldoMenezes RobertTolksdorf Proceedings of ACM SAC 2003 ACM SAC 2003 2003 A swarm-based semantic storage service KiaHannes Mühleisen RobertTeymourian Tolksdorf Poster Session at the 7th Extended Semantic Web Conference

ESWC10

2010 Distributed query processing and catalogs for peer-to-peer systems DavidVassilis Papadimos KristinMaier Tufte CIDR 2003 Querying distributed RDF data sources with SPARQL BastianQuilitz UlfLeser ESWC 2008, Proceedings Lecture Notes in Computer Science Springer 2008 5021