<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Grid Services for Efficient Decentralized Indexation and Query Execution on Distributed Data Warehouses*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pascal Wehrle</string-name>
          <email>pascal.wehrle@insa-lyon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anne Tchounikine</string-name>
          <email>anne.tchounikine@insa-lyon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maryvonne Miquel</string-name>
          <email>maryvonne.miquel@insa-lyon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>20 avenue Albert Einstein</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>69621 Villeurbanne Cedex</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Lyon Research Center for Images and Intelligent Information Systems (LIRIS) UMR CNRS 5205 - INSA Lyon</institution>
        </aff>
      </contrib-group>
      <fpage>13</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>Online analysis of large volumes of data consists an important part of today's business and scientific applications. Storing structured data in data warehouses following a multidimensional storage model provides efficient and reliable access to it. Distributed systems have become the first choice to cope with the increasing volume and complexity of data warehouses. In order keep up with this trend, large quantities of interconnected computing and storage resources are now being integrated into computing grids. Adapting data warehouse concepts to grids however requires decentralized management of their traditionally centralized storage and query execution model. The grid services architecture we present in this paper therefore uses a unique data identification to implement grid-wide data localization based on local indexes in combination with a catalog service. These structures integrate information on both materialized and computable data for cost-optimized distributed query execution.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Modern business intelligence and analysis applications heavily rely on data
warehouses to efficiently store large volumes of structured data and provide a fast and
reliable access to it. As new applications are developed in the domain of data
warehouse-based analysis, data warehouses are often required to deliver data to a
variety of different user groups, each requesting remote access to specific data sets at
custom levels of detail.</p>
      <p>The classical architectural solution to these requirements is a central data warehouse
with both storage and computing of analysis tasks performed by one single machine
with remote access for all users. A common solution to overcome storage and
computing capacity shortages is delegation of both tasks to distributed systems
offering parallelized computing on distributed data. Clusters with centralized control
and access provide high performance and availability. However, given high costs and
varying load, most companies and organizations prefer to dynamically integrate
existing resources from several distant sites to build distributed data warehouses. This
requires systems that efficiently organize these resources in a transparent way.
Computing grids implementing this approach are based on the collaboration of
*This work is supported by the French Ministry for Research ACI “Masse de
données” project (http://acimd.labri.fr/)
independent grid services, the existing mechanisms for organizing and operating
distributed data warehouses cannot be straightly applied. The main challenges are the
internal management of the data distribution among grid nodes, starting from an
initial distribution, and the subsequent query execution requiring active placement of
both data and computing jobs on the grid.</p>
      <p>
        Bioinformatics research is one of the application domains where grids are being
introduced to analyse huge amounts of data using complex algorithms. The grid
services architecture presented in this paper has been developed in the context of the
Grid for Geno-Medicine1 (GGM) project [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], an effort to provide a software
architecture capable of managing heterogeneous and dynamic bio-medical data stored
on a grid infrastructure for intensive analysis and processing purposes.
      </p>
      <p>The rest of this paper is organized as follows: In section 2 we give an overview of
the grid service layers we developed for grid-wide querying based on local indexes of
stored data. Their interaction during cost-optimized distributed query execution are
presented in section 3. Section 4 concludes with an overview of our future work on
the system.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Grid Service Layers</title>
      <p>
        The proposed grid services of our architecture are implemented on the middleware
infrastructure provided by the Globus Toolkit 4.0 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As illustrated in figure 1 these
services are grouped into three distinct layers above the standard Globus data access
and job execution services.
      </p>
      <p>
        The initial distribution of warehouse data is made using derived horizontal
partitioning methods introduced in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], including materialized aggregated
data. For efficient data localization and exchange we use an identification method that
creates one namespace for all warehouse data distributed on grid storage nodes. A
detailed description of the approach is given in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We introduced chunk identifiers
for warehouse data that refer to all replicas of a tuple of warehouse data on the grid.
Each contiguous partition of data is stored in a separate file and identified as a block
of chunks between lower and an upper limit chunks.
      </p>
      <p>
        The blocks of chunks stored on a grid node are indexed by the “Local Indexing
Service“(LIS), the base layer of our architecture. It offers a complete index of locally
available data, both materialized and computable from materialized data. In order to
achieve this the LIS builds a lattice structure that groups together data of the same
1 http://liris.cnrs.fr/projets/ggm/GGM_English
aggregation level and uses a spatial X-tree index introduced in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to efficiently index
blocks of data based on the multidimensional warehouse data model. Whenever files
are materialized or deleted on the storage node, the index recursively updates all
impacted computable data blocks on the index.
      </p>
      <p>
        The information provided by local indexes is assembled, amended and exchanged
by the communication and monitoring services layer that provide efficient localization
of requested warehouse data and up-to-date monitoring information. External
interactions with this layer are executed through the “Chunk Resolution
Service”(CRS). It is deployed on every storage node and executes localization queries
for warehouse data blocks. In order to make every block of indexed data on the grid
available through the network of CRS instances, each CRS publishes the identifiers of
locally available data blocks to a catalog service, the “Chunk Localization Array
Catalog”(CLAC). The CLAC stores a mapping between data block identifiers and a
list of grid nodes from which the data block can be obtained. During query execution,
the list of potential sources for a requested data block has to be amended with
estimates of transfer and computing costs to make optimization choices possible. The
“Network Distance Service”(NDS) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] provides these current cost estimates based on
network bandwidth, latency and CPU load of source and destination nodes.
Optimization itself is controlled by the “Query Execution Management
Service”(QEMS). Deployed on nodes designated as access points for users, it handles
user query transformation, optimization and execution as detailed in the next section.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Query Execution Procedure</title>
      <p>The interaction between the different grid services is best described following the
processing of a user query. We distinguish three main phases in this process: The
transformation and localization phase consists in transforming the user query and
searching the requested data using the communication and monitoring layer services.
The following planning phase utilizes the results to optimize the transfer and
computation of results. At the end stands the execution phase with concurrent
execution of the established distributed query plan.</p>
      <p>First, the QEMS transforms the initial SQL user query into data block identifiers
using the dimensions of the multidimensional warehouse model. A localization query
is then sent to the nearest storage node's CRS which controls the execution of the
localization process. The query is forwarded to the LIS on the same node which
returns the list of locally materialized or computable blocks that satisfy at least part of
the query. Missing parts of the requested data are then searched for by transmitting
requests to the CLAC. Upon receiving the result list of source nodes for the requested
data, the CRS tries to obtain direct references to the source data files containing the
data blocks by polling the designated source nodes. At the same time it requests
monitoring data from the NDS for each of them.</p>
      <p>
        The final results of the localization phase are returned to the QEMS. During the
subsequent optimization phase, the QEMS uses a cost metric inspired by the ones
presented in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. When selecting a replica for a requested block r, it uses the weighted
cost c(f,r) for any source file f, which is proportional to the total cost of retrieval tc(f)
and the set of useful chunks uc(f,r) contained in f that are also part of r:
c f , r =
      </p>
      <p>tc f
uc f , r
, uc f , r = f r (1)</p>
      <p>All source files are next inserted into a list sorted by their weighted cost. A greedy
algorithm then computes a local minimum solution by iteratively selecting the file
with the least cost and updating the cost values for the remainder of the list to avoid
redundant transfers of the same data and to favor retrieval from multiple sources. The
resulting distributed query plan contains independent jobs for each source file.</p>
      <p>During the final execution phase, the QEMS launches file transfers and submits
computing jobs according to the distributed execution plan. Jobs involving additional
computing of aggregates are given a higher priority over simple data transfers. The
query result is progressively assembled through a union operation on the partial
results provided by each job and transferred to the client application.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper we have given an overview of the layered grid services architecture
designed to allow efficient decentralized management of distributed data warehouses
on grid infrastructures. Based on local indexes of both materialized and computable
data on storage nodes our identification method provides grid-wide localization of
requested data through a set of communication and monitoring services. The gathered
information is used for cost-optimized distributed query execution. The architecture
was implemented and tested on the GGM project's grid testbed. Our current efforts
are focused on the integration of a collaborative cache system interacting with local
indexes to better manage data replication and temporary storage. We are also working
on a suitable decentralized maintenance mechanism for the distributed data
warehouse on the grid.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Brunie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miquel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pierson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tchounikine</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dhaenens</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Melab</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talbi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hameurlain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morvan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Information grids: managing and mining semantic data in a grid infrastructure; open issues and application to geno-medical data</article-title>
          .
          <source>In 14th International Workshop on Database and Expert Systems Applications (DEXA'03), September 1-5</source>
          <year>2003</year>
          , Prague, Czech Republic.
          <source>IEEE CS</source>
          , pp.
          <fpage>509</fpage>
          -
          <lpage>516</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Foster</surname>
            ,
            <given-names>I.T.</given-names>
          </string-name>
          :
          <article-title>Globus Toolkit Version 4: Software for Service-Oriented Systems</article-title>
          .
          <source>In IFIP International Conference on Network and Parallel Computing, NPC 2005, November 30 - December 3</source>
          <year>2005</year>
          , Beijing, China. Springer, pp.
          <fpage>2</fpage>
          -
          <lpage>13</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Özsu</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valduriez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Principles of distributed database systems</article-title>
          .
          <source>Prentice Hall</source>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Bellatreche</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karlapalem</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohania</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          :
          <article-title>OLAP Query Processing for Partitioned Data Warehouses</article-title>
          .
          <source>In 1999 International Symposium on Database Applications in Non-Traditional Environments (DANTE '99)</source>
          ,
          <fpage>28</fpage>
          -
          <lpage>30</lpage>
          November 1999, Kyoto, Japan. IEEE CS, pp.
          <fpage>35</fpage>
          -
          <lpage>42</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Wehrle</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miquel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tchounikine</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A Model for Distributing and Querying a Data Warehouse on a Computing Grid</article-title>
          .
          <source>In 11th International Conference on Parallel and Distributed Systems (ICPADS</source>
          <year>2005</year>
          ),
          <fpage>20</fpage>
          -
          <issue>22</issue>
          <year>July 2005</year>
          , Fukuoka, Japan. IEEE CS, pp.
          <fpage>203</fpage>
          -
          <lpage>209</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Berchtold</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keim</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kriegel</surname>
          </string-name>
          , H.:
          <article-title>The X-tree : An Index Structure for High-Dimensional Data</article-title>
          .
          <source>In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB'96), September 3-6</source>
          <year>1996</year>
          , Mumbai (Bombay), India. Morgan Kaufmann Publishers/Elsevier Science, pp.
          <fpage>28</fpage>
          -
          <lpage>39</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Gossa</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pierson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brunie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Evaluation of network distances properties by NDS, the Network Distance Service</article-title>
          .
          <source>In Proceedings of Third International Workshop on Networks for Grid Applications (GridNets</source>
          <year>2006</year>
          ),
          <source>October 1-2</source>
          ,
          <year>2006</year>
          , San Jose, California, U.S.A. IEEE CS.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Weng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Çatalyürek</surname>
            ,
            <given-names>Ü.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kurç</surname>
            ,
            <given-names>T.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saltz</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Servicing range queries on multidimensional datasets with partial replicas</article-title>
          .
          <source>In 5th International Symposium on Cluster Computing and the Grid (CCGrid</source>
          <year>2005</year>
          ),
          <fpage>9</fpage>
          -12 May,
          <year>2005</year>
          , Cardiff, UK. IEEE CS, pp.
          <fpage>726</fpage>
          -
          <lpage>733</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>