13 Grid Services for Efficient Decentralized Indexation and Query Execution on Distributed Data Warehouses* Pascal Wehrle, Anne Tchounikine, Maryvonne Miquel Lyon Research Center for Images and Intelligent Information Systems (LIRIS) UMR CNRS 5205 – INSA Lyon, 20 avenue Albert Einstein, 69621 Villeurbanne Cedex, France {pascal.wehrle, anne.tchounikine, maryvonne.miquel}@insa-lyon.fr Abstract. Online analysis of large volumes of data consists an important part of today's business and scientific applications. Storing structured data in data warehouses following a multidimensional storage model provides efficient and reliable access to it. Distributed systems have become the first choice to cope with the increasing volume and complexity of data warehouses. In order keep up with this trend, large quantities of interconnected computing and storage resources are now being integrated into computing grids. Adapting data warehouse concepts to grids however requires decentralized management of their traditionally centralized storage and query execution model. The grid services architecture we present in this paper therefore uses a unique data identification to implement grid-wide data localization based on local indexes in combination with a catalog service. These structures integrate information on both materialized and computable data for cost-optimized distributed query execution. 1 Introduction Modern business intelligence and analysis applications heavily rely on data warehouses to efficiently store large volumes of structured data and provide a fast and reliable access to it. As new applications are developed in the domain of data warehouse-based analysis, data warehouses are often required to deliver data to a variety of different user groups, each requesting remote access to specific data sets at custom levels of detail. The classical architectural solution to these requirements is a central data warehouse with both storage and computing of analysis tasks performed by one single machine with remote access for all users. A common solution to overcome storage and computing capacity shortages is delegation of both tasks to distributed systems offering parallelized computing on distributed data. Clusters with centralized control and access provide high performance and availability. However, given high costs and varying load, most companies and organizations prefer to dynamically integrate existing resources from several distant sites to build distributed data warehouses. This requires systems that efficiently organize these resources in a transparent way. Computing grids implementing this approach are based on the collaboration of *This work is supported by the French Ministry for Research ACI “Masse de données” project (http://acimd.labri.fr/) 14 independent grid services, the existing mechanisms for organizing and operating distributed data warehouses cannot be straightly applied. The main challenges are the internal management of the data distribution among grid nodes, starting from an initial distribution, and the subsequent query execution requiring active placement of both data and computing jobs on the grid. Bioinformatics research is one of the application domains where grids are being introduced to analyse huge amounts of data using complex algorithms. The grid services architecture presented in this paper has been developed in the context of the 1 Grid for Geno-Medicine (GGM) project [1], an effort to provide a software architecture capable of managing heterogeneous and dynamic bio-medical data stored on a grid infrastructure for intensive analysis and processing purposes. The rest of this paper is organized as follows: In section 2 we give an overview of the grid service layers we developed for grid-wide querying based on local indexes of stored data. Their interaction during cost-optimized distributed query execution are presented in section 3. Section 4 concludes with an overview of our future work on the system. 2 Proposed Grid Service Layers The proposed grid services of our architecture are implemented on the middleware infrastructure provided by the Globus Toolkit 4.0 [2]. As illustrated in figure 1 these services are grouped into three distinct layers above the standard Globus data access and job execution services. The initial distribution of warehouse data is made using derived horizontal partitioning methods introduced in [3] and [4], including materialized aggregated data. For efficient data localization and exchange we use an identification method that creates one namespace for all warehouse data distributed on grid storage nodes. A detailed description of the approach is given in [5]. We introduced chunk identifiers for warehouse data that refer to all replicas of a tuple of warehouse data on the grid. Each contiguous partition of data is stored in a separate file and identified as a block of chunks between lower and an upper limit chunks. The blocks of chunks stored on a grid node are indexed by the “Local Indexing Service“(LIS), the base layer of our architecture. It offers a complete index of locally available data, both materialized and computable from materialized data. In order to achieve this the LIS builds a lattice structure that groups together data of the same 1 http://liris.cnrs.fr/projets/ggm/GGM_English 15 aggregation level and uses a spatial X-tree index introduced in [6] to efficiently index blocks of data based on the multidimensional warehouse data model. Whenever files are materialized or deleted on the storage node, the index recursively updates all impacted computable data blocks on the index. The information provided by local indexes is assembled, amended and exchanged by the communication and monitoring services layer that provide efficient localization of requested warehouse data and up-to-date monitoring information. External interactions with this layer are executed through the “Chunk Resolution Service”(CRS). It is deployed on every storage node and executes localization queries for warehouse data blocks. In order to make every block of indexed data on the grid available through the network of CRS instances, each CRS publishes the identifiers of locally available data blocks to a catalog service, the “Chunk Localization Array Catalog”(CLAC). The CLAC stores a mapping between data block identifiers and a list of grid nodes from which the data block can be obtained. During query execution, the list of potential sources for a requested data block has to be amended with estimates of transfer and computing costs to make optimization choices possible. The “Network Distance Service”(NDS) [7] provides these current cost estimates based on network bandwidth, latency and CPU load of source and destination nodes. Optimization itself is controlled by the “Query Execution Management Service”(QEMS). Deployed on nodes designated as access points for users, it handles user query transformation, optimization and execution as detailed in the next section. 3 Query Execution Procedure The interaction between the different grid services is best described following the processing of a user query. We distinguish three main phases in this process: The transformation and localization phase consists in transforming the user query and searching the requested data using the communication and monitoring layer services. The following planning phase utilizes the results to optimize the transfer and computation of results. At the end stands the execution phase with concurrent execution of the established distributed query plan. First, the QEMS transforms the initial SQL user query into data block identifiers using the dimensions of the multidimensional warehouse model. A localization query is then sent to the nearest storage node's CRS which controls the execution of the localization process. The query is forwarded to the LIS on the same node which returns the list of locally materialized or computable blocks that satisfy at least part of the query. Missing parts of the requested data are then searched for by transmitting requests to the CLAC. Upon receiving the result list of source nodes for the requested data, the CRS tries to obtain direct references to the source data files containing the data blocks by polling the designated source nodes. At the same time it requests monitoring data from the NDS for each of them. The final results of the localization phase are returned to the QEMS. During the subsequent optimization phase, the QEMS uses a cost metric inspired by the ones presented in [8]. When selecting a replica for a requested block r, it uses the weighted cost c(f,r) for any source file f, which is proportional to the total cost of retrieval tc(f) and the set of useful chunks uc(f,r) contained in f that are also part of r: tc  f  c  f , r= , uc  f , r = f r (1) uc  f , r 16 All source files are next inserted into a list sorted by their weighted cost. A greedy algorithm then computes a local minimum solution by iteratively selecting the file with the least cost and updating the cost values for the remainder of the list to avoid redundant transfers of the same data and to favor retrieval from multiple sources. The resulting distributed query plan contains independent jobs for each source file. During the final execution phase, the QEMS launches file transfers and submits computing jobs according to the distributed execution plan. Jobs involving additional computing of aggregates are given a higher priority over simple data transfers. The query result is progressively assembled through a union operation on the partial results provided by each job and transferred to the client application. 4 Conclusion In this paper we have given an overview of the layered grid services architecture designed to allow efficient decentralized management of distributed data warehouses on grid infrastructures. Based on local indexes of both materialized and computable data on storage nodes our identification method provides grid-wide localization of requested data through a set of communication and monitoring services. The gathered information is used for cost-optimized distributed query execution. The architecture was implemented and tested on the GGM project's grid testbed. Our current efforts are focused on the integration of a collaborative cache system interacting with local indexes to better manage data replication and temporary storage. We are also working on a suitable decentralized maintenance mechanism for the distributed data warehouse on the grid. References [1] Brunie, L., Miquel, M., Pierson, J., Tchounikine, A., Dhaenens, C., Melab, N., Talbi, E., Hameurlain, A., Morvan, F.: Information grids: managing and mining semantic data in a grid infrastructure; open issues and application to geno-medical data. In 14th International Workshop on Database and Expert Systems Applications (DEXA'03), September 1-5 2003, Prague, Czech Republic. IEEE CS, pp. 509-516 [2] Foster, I.T.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In IFIP International Conference on Network and Parallel Computing, NPC 2005, November 30 - December 3 2005, Beijing, China. Springer, pp. 2-13 [3] Özsu, M.T., Valduriez, P.: Principles of distributed database systems. Prentice Hall, 1991. [4] Bellatreche, L., Karlapalem, K., Mohania, M.K.: OLAP Query Processing for Partitioned Data Warehouses. In 1999 International Symposium on Database Applications in Non-Traditional Environments (DANTE '99), 28-30 November 1999, Kyoto, Japan. IEEE CS, pp. 35-42 [5] Wehrle, P., Miquel, M., Tchounikine, A.: A Model for Distributing and Querying a Data Warehouse on a Computing Grid. In 11th International Conference on Parallel and Distributed Systems (ICPADS 2005), 20-22 July 2005, Fukuoka, Japan. IEEE CS, pp. 203-209 [6] Berchtold, S., Keim, D.A., Kriegel, H.: The X-tree : An Index Structure for High-Dimensional Data. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB'96), September 3-6 1996, Mumbai (Bombay), India. Morgan Kaufmann Publishers/Elsevier Science, pp. 28-39 [7] Gossa, J., Pierson, J., Brunie, L.: Evaluation of network distances properties by NDS, the Network Distance Service. In Proceedings of Third International Workshop on Networks for Grid Applications (GridNets 2006), October 1-2, 2006, San Jose, California, U.S.A. IEEE CS. [8] Weng, L., Çatalyürek, Ü.V., Kurç, T.M., Agrawal, G., Saltz, J.H.: Servicing range queries on multidimensional datasets with partial replicas. In 5th International Symposium on Cluster Computing and the Grid (CCGrid 2005), 9-12 May, 2005, Cardiff, UK. IEEE CS, pp. 726-733