A Storage Scheme for Multi-dimensional Databases Using
                      Extendible Array Files

                                         Ekow J. Otoo and Doron Rotem
                                         Lawrence Berkeley National Laboratory
                                                   1 Cyclotron Road
                                                University of California
                                              Berkeley, CA 94720, USA
                                          ekw,@hpcrd.lbl.gov,d rotem@lbl.gov


                        Abstract                                1 Introduction

    Large scale scientific datasets are generally mod-          Datasets used in large scale scientific applications, are gen-
    eled as k-dimensional arrays, since this model is           erally modeled as multidimensional arrays and matrices.
    amenable to the form of analyses and visualiza-             Matrices are 2-dimensional rectangular array of elements
    tion of the scientific phenomenon often investi-            (or entries) laid out in rows and columns. Although the
    gated. In recent years, organizations have adopted          logical view of a rectangular array of elements need not be
    the use of on-line analytical processing (OLAP),            the same as the actual physical storage, the elements of the
    methods and statistical analyses to make strate-            arrays are mapped into storage locations according to some
    gic business decisions using enterprise data that           linearization function of the indices. An array file is simply
    are modeled as multi-dimensional arrays as well.            a file of the elements of an array in which the k-dimensional
    In both of these domains, the datasets have the             indices of the elements map into linear consecutive record
    propensity to gradually grow, reaching orders of            locations in the file. The mapping of k-dimensional in-
    terabytes. However, the storage schemes used                dices to linear addresses may either be by a computed ac-
    for these arrays correspond to those where the ar-          cess function, an indexing mechanism, or a mixture of both.
    ray elements are allocated in a sequence of con-            Operations on these array files involve accessing and ma-
    secutive locations according to an ordering of ar-          nipulating sub-arrays (or array chunks) of one or more of
    ray mapping functions that map k-dimensional in-            these array files. Such models for manipulating datasets
    dices one-to-one onto the linear locations. Such            are typical in large scale computing performed by various
    schemes limit the degree of extendibility of the ar-        Scientific and Engineering Simulations, Climate Modeling,
    ray to one dimension only. We present a method              High Energy and Nuclear Physics, Astrophysics, Compu-
    of allocating storage for the elements of a dense           tational Fluid Dynamics, Scientific Visualization, etc.
    multidimensional extendible array such that the                 On-line analytical processing (OLAP) and decision
    bounds on the indices of the respective dimen-              support systems depend highly on efficiently computing
    sions can be arbitrarily extended without reorga-           statistical summary information from multi-dimensional
    nizing previously allocated elements. We give               databases. A view of the dataset as a multi-dimensional
    a mapping function F ∗ (), and its inverse F ∗−1 (),        array model forms the basis for efficient derivation of sum-
    for computing the linear address of an array el-            mary information. The literature distinguishes OLAP on
    ement given its k-dimensional index. The tech-              relational model of data called ROLAP and that on multi-
    nique adopts the mapping function, for realizing            dimensional model termed MOLAP.
    an extendible array with arbitrary extendibility in             Consider a dataset that maintains monitored values of
    main memory, to implement such array files. We              temperature at defined locations given by the latitude and
    show how the extendible array file implementa-              longitude at various time-steps. Figure 1 illustrates a
    tion gives an efficient storage scheme for both sci-        simple multi-dimensional view of a sample data as a 3-
    entific and OLAP multi-dimensional datasets that            dimensional array A[4][3][3] with assigned ordinal coor-
    are allowed to incrementally grow without incur-            dinates. The entries shown in the cells correspond to the
    ring the prohibitive costs of reorganizations.              linear addresses of the cells as the relative displacements
                                                                from cell Ah0, 0, 0i.
Proceedings of the third Workshop on STDBM                          Each cell stores the temperature value (not shown) as
Seoul, Korea, September 11, 2006                                the measure. Suppose we wish to maintain information on


                                                           67
sales of some enterprise that has stores at different loca-            {ℓ0 , ℓ1 . . . , ℓM−1 }, be a sequence of consecutive storage lo-
tions. The locations are defined by their spatial coordinates          cations where M = ∏k−1        j=0 N j . An allocation (or mapping)
of latitude and longitude often abbreviated as Lat/Long.               function F (), maps the k-dimensional indices one-to-one,
The sales information recorded for say three consecutive               onto the sequence of consecutive indices {0, 1, . . . , M − 1},
months can be represented by Figure 1 which gives the                  i.e., F : N0 × N1 × · · · × Nk−1 → {0, 1, . . . , M − 1}. Given
same view of the data as before but with values for sales as           a location index j, the inverse mapping function F −1 ( j),
the measures. A multi-dimensional array correctly models               computes the k-dimensional index hi0 , i1 . . . , ik−1 i that cor-
the dataset in both domains of scientific applications and             responds to j. The significance of inverse mapping func-
MOLAP. Further common characteristics associated with                  tions have not been much appreciated in the past but has
multi-dimensional model of data under these domains are                important use in computing the addresses of the neighbors
that:                                                                  of an element given its linear storage address and also for
  1. The data incrementally grow over time by appending                managing a general k-dimensional sparse array as opposed
     new data elements. These may reach orders of ter-                 to sparse matrices which is 2-dimensional.
     abytes, as in data warehousing.                                       Modern programming languages provide native support
                                                                       for multidimensional arrays. The mapping function F ()
  2. The datasets are mainly read-only. However, they may              is normally defined so that elements are allocated either
     be subject to expansions in the bounds of the dimen-              in row-major or column major order. We refer to arrays
     sions, e.g., as new monitoring locations are enabled or           whose elements are allocated in this manner as conven-
     new stores are opened.                                            tional arrays. Conventional arrays limit their growth to only
                                                                       one dimension. We use the term row-major order in a gen-
  3. The number of dimensions (i.e., the rank) of the array
                                                                       eral sense, beyond row-and-column matrices, to mean an
     may be extended.
                                                                       order in which the leftmost index of a k-dimensional index
  4. The array representation can be either dense or sparse.           is the least varying. This is also sometimes referred to as
                                                                       the lexicographic order. A column-major order refers to an
    In the illustration of Figure 1, the array may grow by             ordering in which the rightmost index varies the slowest.
appending data for new time-steps. When new locations                      In conventional k-dimensional arrays, efficient compu-
are added, the bounds of the Lat/Long dimensions may be                tation of the mapping function is carried out with the aid
required to be extended. The mapping function depicted in              of a vector that holds k multiplicative coefficients of the
the above figure (represented by the label inside each cell)           respective indices. Consequently, an N-element array in k-
corresponds to that of conventional array mapping that al-             dimensions, can be maintained in O(N) storage locations
lows extendibility in one dimension only; namely the di-               and the computation of the linear address of an element,
mension in the mapping scheme that is least varying. We                given its k-dimensional coordinate index, is done in time
desire an array addressing scheme that allows extensions in            O(k) since this is done by k − 1 multiplications and k − 1
both the bound and rank of the array file and also efficiently         additions. We achieve similar efficiency in the management
manages sparse arrays. In this paper we address primarily              of extendible arrays by maintaining vectors of multiplying
the storage scheme for managing dense extendible array                 coefficients each time the array expands. The computation
efficiently. We discuss how the technique is tailored for              of the linear address, corresponding to a k-dimensional in-
handling sparse arrays as well without detailed experimen-             dex, uses these stored vectors of coefficients. The vectors
tation due to space limitation.                                        of multiplicative coefficients maintain records of the his-
    Over the years special file formats have been extensively          tory of the expansions and are organized into Axial-Vectors.
studied and developed for storing multi-dimensional array              There is one Axial-Vector for each dimension and holds en-
files that support sub-array accesses in high performance              tries that we refer to as expansion records. The fields of an
computing. These include NetCDF [10], HDF5 [7] and                     expansion record are described later.
disk resident array (DRA) [11]. DRA is the persistent
                                                                           Consider a k-dimensional array of N elements and
counterpart of the distributed main memory array structure
                                                                       after some arbitrary extensions dimension j maintains
called Global Array [12]. Except for HDF5, these array
                                                                       E j records of the history of expansion for dimension
files allow for extendibility only in one dimension. We say
                                                                        j. Our approaches computes the linear address from
an array file is extendible if the index range of any dimen-
                                                                       the k-dimensional index in time O(k + ∑k−1           j=0 E j ) using
sion can be extended by appending to the storage of pre-
                                                                               k−1
viously allocated elements, so that new elements can be                O(k ∑ j=0 E j ) additional space.
addressed from the newly adjoined index range. The ad-                     The technique being presented in this paper can be
dress calculation is done without relocating elements of the           adopted by compiler developers for generating mapping
previously allocated elements and without modifying the                functions for dense multidimensional extendible arrays in
addressing function.                                                   memory as well. The Standard Template Library (STL) in
    Let A[N0 ][N1 ] . . . [Nk−1 ], denote a k-dimensional array        C++ also provides support for resizable vectors. However,
where N j , 0 ≤ j < k − 1, is the bound on the indices of di-          the general technique for allowing array extensions without
mension j. An element, denoted by Ahi0 , i1 . . . , ik−1 i, is ref-    the extensive cost of reallocating storage is what we desire.
erenced by a k-dimensional index hi0 , i1 . . . , ik−1 i. Let L =      The approach we propose may be used in implementing


                                                                  68
special libraries such as the Global Array (GA) library [12]       N j , 0 ≤ j < k − 1, denote the bounds of the index ranges of
and the disk resident array (DRA) [11] that manage dense           the respective dimensions. Suppose the elements of this ar-
multidimensional arrays.                                           ray are allocated in the linear consecutive storage locations
   To handle sparse arrays, we adopt the technique of ar-          L = {ℓ0 , ℓ1 . . . , ℓM−1 }, in row-major order according to the
ray chunking [4, 7, 5], where the linear address of a chunk        mapping function F (). In the rest of this paper, we will
is computed as for a dense array. The generated address            always assume row-major ordering and for simplicity we
forms the key of an array chunk that is used in an index           will assume an array element occupies a unit of storage. A
scheme that stores only non-empty chunks. Partial chunks           unit of storage could be one word of 4 bytes, a double word
can be further compressed and grouped into physical buck-          of 8 bytes, etc. An element Ahi0 , i1 , . . . ik−1 i is assigned to
ets to maintain efficient storage utilization. A thorough dis-     location ℓq , where q is computed by the mapping function
cussion of our approach is in a follow-up paper.                   defined as
   The main contributions in this paper are:
                                                                   q = F (hi0 , i1 , . . . ik−1 i) = i0 ∗ C0 + i1 ∗ C1 + · · · + ik−1 ∗ Ck−1
  • A definition of a mapping function for extendible ar-                                           k−1
    rays that is applicable for both dense in-core arrays                        where C j =        ∏ Nr , 0 ≤ j ≤ k − 1.
    and out-of-core arrays. We call the approach the The                                          r= j+1
    Axial-Vector method. The general principle is not                                                                                  (1)
    new. The idea was first proposed with the use of
    an auxiliary array. However, the storage overhead              and Ah0, 0, . . . 0i assigned to ℓ0 .
    using an auxiliary array could be prohibitive. The                 In most programming languages, since the bounds of
    new Axial-Vector method obviates the storage over-             the arrays are known at compilation time, the coefficients
    head and also gives a new method for computing the             C0 ,C1 , . . . ,Ck−1 are computed and stored during code gen-
    linear addresses.                                              eration. Consequently, given any k-dimensional index,
                                                                   the computation of the corresponding linear address using
  • When extremely large array files are generated and             Equation 1, takes time O(k).
    stored, any dimension can still be expanded by ap-                 Suppose we know the linear address q of an array ele-
    pending new array elements without the need to re-             ment, the k-dimensional index hi0 , i1 , . . . ik−1 i correspond-
    organize the already allocated storage which could be          ing to q can be computed by repeated modulus arith-
    of the order of terabytes.                                     metic with the coefficients Ck−2 ,Ck−3 , . . . ,C1 in turn, i.e.,
                                                                   F −1 (q) → hi0 , i1 , . . . ik−1 i. When allocating elements of a
  • The mapping function proposed in this paper incurs             dense multidimensional array in a file, the same mapping
    very little storage overhead and can be used as a re-          function is used where the linear address q gives the dis-
    placement for the access function for array files such         placement relative to the location of the first element in the
    as NetCDF and the data chunks in the HDF5 file for-            file, in units of the size of the array elements. The limita-
    mat.                                                           tion imposed by F is that the array can only be extended
  • We discuss an extension of the technique to handle             along dimension 0 since the evaluation of the function does
    sparse extendible arrays for both in-core and out-of-          not involve the bound of N0 .
    core arrays.                                                       We can still circumvent the constraint imposed by F
                                                                   by shuffling the order of the indices whenever the array is
    The organization of this paper is as follows. In the next      extended along any dimension. The idea of extending the
section we present some details of the definition of the           index range of a dimension is simply to adjoin a block (or a
mapping function for dense multidimensional extendible             hyperslab) of array elements whose sizes on all dimensions
arrays in main memory since the same mapping function              remain the same except for the dimension being extended.
is adopted for accessing elements of extendible array files.
In section 3 we describe how an array file is implemented.         2.2 The Mapping Function for an In-Core Extendible
We describe how sparse multidimensional array files are                Array
managed in section 4. We give some experimental com-               The question of organizing an extendible array in mem-
parison of the array mapping functions for extendible and          ory, such that the linear addresses can be computed in the
conventional arrays in section 5. We conclude in section 6         manner similar to those of a static array described above,
and give some directions for our future work.                      is a long standing one [15]. Some solutions exist for
                                                                   extendible arrays that grow to maintain some predefined
2 The Mapping Function for an Extendible                           shapes [14, 15]. A solution was proposed that uses an aux-
  Array                                                            iliary array [13] to keep track of the information needed to
                                                                   compute the mapping function. In [16], a similar solution
2.1 Addressing Function for a Conventional Array
                                                                   was proposed that organized the content of the auxiliary
First we explore some details of how conventional arrays           array with a B-Tree. The use of auxiliary arrays can be
are mapped onto consecutive storage locations in memory.           prohibitively expensive in storage depending on the pattern
Consider a k-dimensional array A[N0 ][N1 ] . . . [Nk−1 ], where    of expansions of the array. We present a new approach to


                                                              69
organizing the content of the auxiliary array with the use                sions retain their relative order. Denote the location of
of Axial-Vectors. The idea of using axial-vectors to replace              Ah0, 0, . . . , N∗l , . . . , 0i as ℓM∗l where M∗l = ∏r=0
                                                                                                                                k−1
                                                                                                                                    (N∗r ). Then
the auxiliary array was introduced in [20] but only for 2-                the desired mapping function F ∗ () that computes the ad-
dimensional arrays. The method introduced maintains the                   dress q∗ of a new element Ahi0 , i1 , . . . ik−1 i during the allo-
same information as in the auxiliary-array approach and                   cation is given by:
does not generalize easily to k-dimensional arrays. Fur-
thermore since it requires that information be stored for
each index of any dimension, the method incurs the same                                                                              k−1
prohibitive cost for certain array shapes. The method intro-               q∗ = F ∗ (hi0 , i1 , . . . ik−1 i) = M∗l + (il − N∗l )Cl∗ + ∑ i jC∗j
duced in this paper avoids these problems. First we show                                                                             j=0
how an in-core extendible array is organized with the aid of                                                                         j6=l
axial-vectors since the same mapping function is used for                                                   k−1                        k−1
array files.                                                                               where Cl∗ = ∏ N∗j           and C∗j =       ∏ N∗r
                                                                                                             j=0                      r= j+1
                                                                                                             j6=l                      r6=l
2.3 The Axial-Vector Approach for Extendible Arrays                                                                                            (2)
Consider Figure 2a that shows a map of the storage allo-
cation of 3-dimensional extendible array. We denote this                      We need to retain for dimension l the values of M∗l - the
in general as A[N∗0 ][N∗1 ][N∗2 ], where N∗j represents the fact          location of the first element of the hyperslab, N∗l - the fisrt
that the bound has the propensity to grow. In this paper                  index of the adjoined hyperslab, and Cr∗ , 0 ≤ r < k - the
we address only the problem of allowing extendibility in                  multiplicative coefficients, in some data structure so that
the array bounds but not its rank. The labels shown in the                these can be easily retrieved for computating an element’s
array cells represent the linear addresses of the respective              address within the adjoined hyperslab. The axial-vectors
elements, as a displacement from the location of the first                denoted by Γ j [E j ], 0 ≤ j < k, and shown in Figure 2b, are
element.                                                                  used to retain the required information. E j is the number of
   Suppose initially the array is allocated as A[4][3][1],                stored records for axial-vector Γ j . Note that the number of
where the corresponding axes of Latitude, Longitude and                   elements in each axial-vector is always less than or equal
Time have the instantaneous respective bounds of N∗0 =                    to the number of indices of the corresponding dimension.
4, N∗1 = 3 and N∗3 = 1. The array was extended by one time-               It is exactly the number of uninterrupted expansions. In the
step followed by another time-step. The sequence of the                   example of Figure 2b, E 0 = 2, E 1 = 2, and E 2 = 3.
two consecutive extensions along the same time dimension,
although occurring at two different instances, is considered                  The information of each expansion record of a dimen-
as an uninterrupted extension of the time dimension. Re-                  sion is a record comprised of four fields. For dimen-
peated extensions of the same dimension, with no interven-                sion l, the ith entry denoted by Γl hii consists of Γl hii.N∗l ;
ing extension of a different dimension, is referred to as an              Γl hii.M∗l ; Γl hii.C[k] - the stored multiplying coefficients
interrupted extension and is handled by only one expansion                for computing the displacement values within the hyper-
record entry in the axial-vector.                                         slab; and Γl hii.Sil - the memory address where the hyper-
   The labels shown in the array cells represent the linear               slab is stored. Note however that for computing record ad-
addresses of the respective elements. For example, in Fig-                dresses of array files, this last field is not required, since
ure 2a the element Ah2, 1, 0i is assigned to location 7 and               new records are always allocated by appending to the exist-
element Ah3, 1, 2i is assigned to location 34. The array was              ing array file. In main memory an extendible array may be
subsequently extended along the longitude axis by one in-                 formed as a collection of disjoint hyperslabs since a block
dex, then along the latitude axis by 2 indices and then along             of memory acquired for each new hyperslab may not nec-
the time axis by one time-step. A hyperslab of array el-                  essarily be contiguous to a previously allocated one. Con-
ements can be perceived as an array chunk (a term used                    tiguiety in memory allocation is only guranteed for unin-
in [17, 7]), where all but one of the dimensions of a chunk               trrupted expansion of a dimension, i.e., when the same di-
take the maximum bounds of their respective dimensions.                   mension is repeatedly expanded.
   Consider now that we have a k-dimensional extendible                      Given a k-dimensional index hi0 , i1 , . . . , ik−1 i, the main
array A[N∗0 ][N∗1 ] . . . [N∗k−1 ], for which dimension l is ex-          idea in correctly computing the linear address is in deter-
tended by λl , so that the index range increases from N∗l                 mining which of the records Γ0 hz0 i, Γ1 hz1 i . . . Γk−1 hzk−1 i,
to N∗l + λl . The strategy is to allocate a hyperslab of ar-              has the first maximum starting address of its hyperslab. The
ray elements such that addresses within the hyperslab are                 index z j is given by a modified binary search algorithm that
computed as displacements from the location of the first                  always gives the highest index of the axial-vector where the
element of the hyperslab. Let the first element of a hy-                  expansion record has a maximum starting address of its hy-
perslab of dimension l be denoted by Ah0, 0, . . . , N∗l , . . . , 0i.    perslab less than or equal to i j .
Address calculation is computed in row-major order as
before, except that now dimension l is the least varying                     For example, suppose we desire the linear address of
dimension in the allocation scheme but all other dimen-                   the element Ah4, 2, 2i, we first note that z0 = 1, z1 = 0, and


                                                                     70
 Array               Address        Inverse addr.     Storage          3 Managing Multidimensional Extendible
 Method             calculation      calculation     overhead            Array Files
 Conventional          O(k)             O(k)                0          An array file is simply a persistent storage of the corre-
                                                                       sponding main memory resident array in a file, augmented
 Axial Vectors         O(k+                                            with some meta-data information, either as a header in the
                                                                       same file or in a separated file. We will consider array files
                   k log(k + E )
                                                                       as formed in pairs: the primary file Fp and the meta-data
                     −k log k)     O(k + log E )       O(kE )
                                                                       file Fm . For extremely large files, the content in memory at
                                                                       any time is a subset, or a subarray of the entire disk resident
Table 1: Summary of features of extendible array realiza-
                                                                       array file. Scientific applications that process these arrays
tion.
                                                                       consider array chunks as the unit of data access from sec-
                                                                       ondary storage. Most high performance computations are
z2 = 1. We then determine that
                                                                       executed as a parallel program either on a cluster of work-
      M∗l =    max(Γ0 h1i.M∗0 , Γ1 h0i.M∗1 , Γ2 h1i.M∗2 )              stations or on massively parallel machines. The model of
                                                                (3)    the data is a large global array from which subarrays are
          =    max(48, −1, 12);
                                                                       accessed into individual workstations. The subarrays of in-
from which we deduce that M∗l = 48, l = 0, and N∗l = N∗0 =             dividual nodes together constitute tiles of the global array.
4. The computation F ∗ (h4, 2, 2i) = 48 + 12 × (4 − 4) + 3 ×           Actual physical access of array elements is carried out in
2 + 1 × 2 = 48 + 0 + 6 + 2 = 56. The value 56 is the linear            units of array chunks for both dense and sparse arrays. We
address relative to the starting address of 0. The main char-          discuss this in some detail in the next section.
acteristics of the extendible array realization is summarized              For the subsequent discussions, we will ignore most of
in the following theorem                                               the details of the organization of the array files and also the
                                                                       details of the structure of the array elements. For simplicity,
                                                                       we consider the array files as being composed of fixed size
Theorem 2.1. Suppose that in a k-dimensional extendible
                                                                       elements, where each element is composed of a fixed num-
array, dimension j undergoes E j , uninterrupted expan-
                                                                       ber of attributes. Each attribute value has a corresponding
sions. Then if E = ∑k−1   j=0 E j , the complexity of comput-          ordinal number that serves as the index value. Mapping of
ing the function F ∗ () for an extendible array using axial-           attribute values to ordinal numbers is easily done for array
vectors is O(k + k log(k + E ) − k logk), using O(kE ) worst           files. The primary file Fp , contains elements of the multidi-
case space.                                                            mensional array that continuously grows by appending new
                                                                       data elements whenever a dimension is extended. For ex-
Proof. The worst case sizes of the axial-vectors occur if              ample, a dimension corresponding to time may be extended
each dimension has the same number of uninterrupted ex-                by the addition of data from new time-steps. A meta-data
pansions; i.e., E j = E /k. The evaluation of F ∗ () involves          file Fm stores the records that correspond to the axial vec-
k log E j followed by k multiplications k additions and 1              tors. The contents of the meta-data file Fm are used to re-
subtraction, giving a total of O(k + k(log(1 + E /k))) =               construct the memory resident axial vectors. Each expan-
O(k + k log(k + E ) − k logk).                                         sion of a dimension results in an update of an axial vector
   The additional space requirement for the k axial-vectors            and consequently the meta-data file as well.
is O((k + 3) ∑k−1                                                          Suppose an application has already constructed the
                j=0 E j ) = O(kE ).
                                                                       memory resident axial vectors from the meta-data file, then
                                                                       the linear address of an element (i.e., a record), of the ar-
                                                                       ray file given its k-dimensional coordinates, is computed
    Given a linear address q∗ , it is easy to find, an entry
                                                                       using the mapping function F ∗ (). Essentially, F ∗ () serves
in the axial vectors that has the maximum starting address
                                                                       as a hash function for the elements of the array file. Con-
less than or equal to q∗ using a modified binary search algo-
                                                                       versely, if the linear address q∗ , of an element is given and
rithm as in the address computation. The repeated modulus
                                                                       one desires the neighbor element that lies some units of co-
arithmetic as described in section 2.1 is then used to extract
                                                                       ordinate distances along some specified dimensions from
the k-dimensional indices. We state without a formal proof
                                                                       the current, such an element can be easily retrieved with
the following.
                                                                       the aid of the inverse function F ∗−1 (). First we compute
                                                                       the k-dimensional coordinate values from F ∗−1 (q), adjust
Theorem 2.2. Given a linear address q of an element of                 the coordinate values along the specified dimensions and
an extendible array realized with the aid of k axial-vectors,          compute the address of the desired element by F ∗ (). The
the the k-dimensional index of an element is computable by             relevant algorithms for these operations are easily derived
the function F ∗−1 in time O(k + log E ).                              from the definitions of the functions presented in the pre-
                                                                       ceding sections. One of the main features of our scheme
  The main results of the extendible array realization are             is that when extremely large array files are generated, each
summarized in Table 1.                                                 dimensions can still be expanded by appending new array


                                                                  71
elements without the need to reorganize the allocated stor-       lem of a potential exponential growth for high dimen-
age of the files.                                                 sional datasets. Instead of a table map for maintaining ar-
   Two popular data organization schemes for large scale          ray chunks, we use a dynamically constructed PATRICIA
scientific datasets are NetCDF [10] and HDF5 [7]. The             trie [9] to manage the table map of the chunks. Other meth-
NetCDF maintains essentially an array file according to a         ods of handling sparse multi-dimensional arrays have been
row-major ordering of the elements of a conventional array.       described in [4, 8]. Some of the techniques for handling
Consequently, the array can only be extended in one dimen-        sparse matrices [1] can also be adopted.
sion. The technique presented in this paper can be easily
adopted as the mapping function of the NetCDF storage             4.1 Alternative Methods           for   Addressing      Array
scheme to allow for arbitrary extensions of the dimensions            Chunks
of the NetCDF file structure, without incurring any addi-
                                                                  The use of a table map for locating the physical locations
tional access cost. HDF5 is a storage scheme that allows ar-
                                                                  of array chunks relaxes the need for directly addressing ar-
ray elements to be partitioned in fixed size sub-arrays called
                                                                  ray elements with an extendible array mapping function.
data-chunks. A chunk is physically allocated on secondary
                                                                  One can further relax this constraint by doing away with
storage and accessed via a B+ -tree index. Chunking allows
                                                                  the extendible array mapping function entirely. Rather, a
for extendibility of the array along any dimension and also
                                                                  method for constructing a unique address Ihi0 ,...ik i of an ar-
for the hierarchical organization of the array elements, i.e.,                                                        0
elements of a top level array is allowed to be an array of        ray chunk from the k-dimensional index hi0 , . . . ik0 i is all
a refined higher resolution and so on. The mapping func-          that is required. One such method is given by concatenat-
tion introduced can be used as a replacement of the B+ -tree      ing the binary representation of the coordinate indices of an
indexing scheme for the HDF5 array chunks. Other appli-           array chunk. The unique address generated is then used in
cations of the mapping function introduced here include its       an index scheme such as a B+ -tree, to locate the physical
use for the Global-Arrays [12] data organization and Disk-        chunk where an array element is stored. This approach is
Resident-Array files [11].                                        actually implemented in the HDF5 storage format. There
                                                                  are two problems with this approach;
4 Managing Sparse Extendible Array Files                            1. Either the k-dimensional index or the generated iden-
Besides the characteristics that multi-dimensional                     tifier for the chunk must be stored with the chunk. For
databases incrementally grow over time, they also                      the latter case, an inverse function for computing the
have the unkind property of being sparse. Techniques                   k-dimensional chunk index from the chunk identifier
for managing large scale storage of multi-dimensional                  is needed but is less space consuming.
data have either addressed the sparsity problem of the              2. It does not handle both memory resident array and
array model [2, 3, 4, 5, 17, 18] or the extendibility prob-            disk resident arrays uniformly.
lem [16, 19] but not both simultaneously. The sparsity
of multidimensional array is managed by array chunking                In general the table map can be replaced with any index-
technique. An array chunk is defined as a block of data           ing scheme that maintains O(N) chunk address for exactly
that contains extents of all dimensions from a global             N non-empty chunks. The use of a PATRICIA trie guar-
multi-dimensional array. Even for dense array, an array           antees that. Using extendible array mapping function for
chunk constitutes the unit of transfers between main              computing the linear address of a chunk has the advantage
memory and secondary storage.                                     of:
   Figure 3a shows a 3-dimensional array partitioned into
chunks using index-intervals of 3. Addressing elements of           1. giving us a uniform manner of managing extendible
the array is computed in two levels. The first level address           arrays resident both in-core and out-of-core.
computation gives the chunk address of the element. The
                                                                    2. allowing the replacement of the global k-dimensional
second level address is computed as the displacement of the
                                                                       address of an array element by one which only defines
array element within the array chunk. The extendible array
                                                                       its linear location within an array chunk and yet en-
addressing method maps the k-dimensional index of each
                                                                       ables us to compute the global k-dimensional index
chunk into a Table Map. The table map of the chunks con-
                                                                       from the linear address.
tains pointers to the physical addresses of the data blocks
that hold chunks. An array chunk forms the unit of data
                                                                  4.2 Operations on Multidimensional Array Files
transfer between secondary storage and main memory. A
table map pointer is set to null if the chunk holds no array      Besides the creation of the multi-dimensional array files,
elements. As in extendible hashing, the use of table map to       and operations for reading, writing (i.e., accessing ) and
address chunks guarantees at most 2 disk accesses to locate       appending new element, the application domain dictates
a chunk. Array chunks that have less than some defined            the type of operations the array files are subjected to.
number of array elements can be compressed further.               While multi-dimensional OLAP applications see the effi-
   The table map is one of the simplest techniques for han-       cient computation of the Cube operator [6, 21] as a sig-
dling sparse extendible array but suffers from the prob-          nificant operation, applications in other scientific domains


                                                             72
require efficient extractions of sub-arrays for analysis and     model, we find that the average cost of accessing array el-
subsequent visualization. In both domains, efficiently ac-       ements for extendible arrays is significantly less than for
cessing elements of a sub-array is vital to all computations     conventional arrays that incur the additional cost of reorga-
carried out.                                                     nization.
    The mapping function provides direct addressing for ar-         Similar experiments were conducted, for accessing ele-
ray chunks and subsequently to the array elements. A             ments of array files instead of memory resident arrays. Fig-
naive approach to performing sub-array extraction opera-         ure 5a compares the average time to access elements for 2,
tion would be to iterate over the k-dimensional coordinates      3 and 4 dimensional static files. There is very little varia-
of the elements to be selected and retrieve each array el-       tion in the times of the conventional and extendible array
ements independently. Suppose the cardinality of the re-         functions. However, when these times are computed with
sponse set of the first selection class is ℜ. The naive ap-      interleaved expansions, the extendible array clearly outper-
proach performs ℜ independent disk accesses. But one can         forms the conventional array methods by several orders of
do better than this worst case number of disk accesses. An       magnitude. Figure 5b shows the graphs for 2, 3 and 4-
efficient method for processing any of the above queries         dimensional extendible array files only. The extra time and
is to compute the chunk identifies of the element; and for       storage required to reorganize the conventional array files
each chunk retrieved, extract all elements the chunk that        with interleaved expansions become prohibitive as the ar-
satisfies the request. Due to space limitation, detailed dis-    ray becomes large. One can infer from these results that, in
cussions on the structures, algorithms and experiments on        handling large scale dense multi-dimensional dataset that
extendible sparse multidimensional arrays is left out in this    incrementally grow by appending elements, the extendible
paper.                                                           array mapping function should be the choice for addressing
                                                                 storage.
5 Performance of Extendible Array Files                             We should mention that a competitive model for com-
The theoretical analysis of the mapping function for ex-         parisons of multi-dimensional array file implementations
tendible arrays indicates that it is nearly of the same order    should be with the HDF5 data schemes. Our implementa-
of computational complexity as that of conventional arrays.      tion currently does not include array caching of data pages
The main difference being the additonal time requied by the      which the HDF5 implementation uses. Work is still ongo-
mapping function for an extendible array to perform binary       ing to add main memory buffer pools for data caching at
searches in the axial-vectors. We experimentally tested this     which time a fair comparison would be made.
by computing the average access times of both the conven-
tional array and extendible array for an array size of ap-
proximately 108 elements of double data types. We varied
the rank of the array from 2 to 8 while keeping the size of      6 Conclusion and Future Work
the array about the same. We plotted the average time over
10000 random element access for static arrays. These ex-
periments were run on a 1.5GHz AMD Athlon processor              We have shown how a k-dimensional extendible array file
running Centos-4 Linux with 2GByte memory. Figure 4a             can be used to implement multi-dimensional databases.
show the graphs of the access times averaged over 10000          The technique applies to extendible arrays in-core just as
element access.                                                  much as for out-of-core extendible arrays that can be either
    The graphs indicate that for static arrays, the cost of      dense or sparse. The method relies on a mapping func-
computing the extendible array access function varies more       tion that uses information retained in axial-vectors to com-
significantly with the rank of the array than for the conven-    pute the linear storage addresses from the k-dimensional
tional array access function. Even though the complexity         indices. Given the characteristics of multi-dimensional
of computing the access functions are both O(k), the ex-         databases that they incrementally grow into terabytes of
tendible array access function shows strong dependence on        data, developing a mapping function that does not require
the array’s rank k due to the fact that k binary searches are    reorganization of the array file as the file grows is a desir-
done in the axial-vectors.                                       able future.
    We also considered the impact on the average access             The method proposed is highly appropriate for most sci-
cost when the arrays undergo interleaved expansions. The         entific datasets where the model of the data is perceived
experiment considered arrays of ranks 2 and 3 where the          typical as large global array files. The mapping function
initial array size grew from about 10000 elements to about       developed can be used to enhance current implementations
106 elements. For each sequence of about 10000 accesses          of array files such as NetCDF, HDF5 and Global Arrays.
the array is allowed to undergo up to 16 expansions. A           Work is still on-going to incorporate our proposed solution
dimension selected at random, is extended by a random            to multidimensional array libraries and to extend the tech-
integer amount of between 1 and 10. Each time the con-           nique for multi-dimensional datasets whose dimensions or
ventional array is extended, the storage allocation is reor-     ranks are allowed to expand. We are also conducting com-
ganized. The average cost of accessing array elements with       parative studies on the different techniques for managing
interleaved expansions is shown in Figure 4b. Under this         sparse multi-dimensional extendible arrays.


                                                            73
Acknowledgment                                                        [10] NetCDF (Network Common Data Form) Home Page.
                                                                           http://my.unidata.ucar.edu/content/software/netcdf/index.html.
This work is supported by the Director, Office of Lab-
oratory Policy and Infrastructure Management of the U.                [11] J. Nieplocha and I. Foster. Disk resident arrays: An
S. Department of Energy under Contract No. DE-AC03-                        array-oriented I/O library for out-of-core computa-
76SF00098. This research used resources of the National                    tions. In Proc. IEEE Conf. Frontiers of Massively
Energy Research Scientific Computing (NERSC), which is                     Parallel Computing Frontiers’96, pages 196 – 204,
supported by the Office of Science of the U.S. Department                  1996.
of Energy.
                                                                      [12] J. Nieplocha, R. J. Harrison, and R. J. Littlefield.
References                                                                 Global Arrays: A nonuniform memory access pro-
                                                                           gramming model for high-performance computers.
 [1] J. Demmel, J. Dongarra, A. Ruhe, and H. van der                       The Journal of Supercomputing, 10(2):169 – 189,
     Vorst. Sparse Matrix Storage Formats, in Templates                    1996.
     for the solution of algebraic eigenvalue problems: A
     practical guide. Society for Industrial and Applied              [13] E. J. Otoo and T. H. Merrett. A storage scheme for
     Mathematics, Philadelphia, PA, USA, 2000.                             extendible arrays. Computing, 31:1–9, 1983.

 [2] P. M. Deshpande, K. Ramasamy, A. Shukla, and J. F.               [14] M. Ouksel and P. Scheuermann. Storage mappings
     Naughton. Caching multidimensional queries using                      for multidimensional linear dynamic hashing. In Pro-
     chunks. In Proc. ACM-SIGMOD, 1998, pages 259–                         ceedings of the Second ACM SIGACT-SIGMOD Sym-
     270, 1998.                                                            posium on Principles of Database Systems, pages 90
                                                                           – 105, Atlanta, March 1983.
 [3] P. Furtado and P. Baumann. Storage of multidimen-
                                                                      [15] A. L. Rosenberg. Allocating storage for extendible
     sional arrays based on arbitrary tiling. In Proc. of 15th
                                                                           arrays. J. ACM, 21(4):652–670, Oct 1974.
     Int’l. Conf. on Data Eng. (ICDE’99), page 480, Los
     Alamitos, CA, USA, 1999. IEEE Computer Society.                  [16] D. Rotem and J. L. Zhao. Extendible arrays for
                                                                           statistical databases and OLAP applications. In 8th
 [4] S. Goil and A. N. Choudhary. Sparse data storage                      Int’l. Conf. on Sc. and Stat. Database Management
     schemes for multidimensional data for olap and data                   (SSDBM ’96), pages 108–117, Stockholm, Sweden,
     mining. Technical Report CPDC-TR-9801-005, Cen-                       1996.
     ter for Parallel and Dist. Comput, Northwestern Univ.,
     Evanston, IL-60208, 1997.                                        [17] S. Sarawagi and M. Stonebraker. Efficient organiza-
                                                                           tion of large multidimenional arrays. In Proc. 10th
 [5] S. Goil and A. N. Choudhary. High performance mul-                    Int’l. Conf. Data Eng., pages 328 – 336, Feb 1994.
     tidimensional analysis of large datasets. In Int’l. Wk-
     shp on Data Warehousing and OLAP, pages 34–39,                   [18] Kent E. Seamons and Marianne Winslett. Physical
     1998.                                                                 schemas for large multidimensional arrays in scien-
                                                                           tific computing applications. In Proc. 7th Int’l. Conf.
 [6] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,                        on Scientific and Statistical Database Management,
     D. Reichart, M. Venkatrao, F. Pellow, and H. Pira-                    pages 218–227, Washington, DC, USA, 1994. IEEE
     hesh. Data cube: A relational aggregation operator                    Computer Society.
     generalizing group-by, cross-tab, and sub-totals. J.
     Data Mining and Knowledge Discovery, 1(1):29–53,                 [19] T. Tsuji, A. Isshiki, T. Hochin, and K. Higuchi. An
     1997.                                                                 implementation scheme of multidimensional arrays
                                                                           for molap. In DEXA, Workshop, pages 773 – 778, Los
 [7] Hierachical Data Format (HDF) group.      HDF5                        Alamitos, CA, USA, 2002. IEEE Computer Society.
     User’s Guide. National Center for Supercomput-
     ing Applications (NCSA), University of Illinois,                 [20] T. Tsuji, H. Kawahara, T. Hochin, and K. Higuchi.
     Urbana-Champaign, Illinois, Urbana-Champaign, re-                     Sharing extendible arrays in a distributed environ-
     lease 1.6.3. edition, Nov. 2004.                                      ment. In IICS ’01: Proc. of the Int’l. Workshop on
                                                                           Innovative Internet Comput. Syst., pages 41–52, Lon-
 [8] Nikos Karayannidis and Timos Sellis. Sisyphus: The                    don, UK, 2001. Springer-Verlag.
     implementation of a chunk-based storage manager for
     olap data cubes. Data snf Knowl. Eng., 45(2):155–                [21] Y. Zhao, P. M. Deshpande, and J. F. Naughton. An
     180, 2003.                                                            array-based algorithm for simultaneous multidimen-
                                                                           sional aggregates. In Proc. ACM-SIGMOD Conf.,
 [9] D. E. Knuth.     The Art of Computer Program-                         pages 159–170, 1997.
     ming: Fundamental Algorithms, volume 1. Addison-
     Wesley, Reading, Mass., 1997.


                                                                 74
                                                                                                                 Time
                                                                                                       2

                                                                                                                24
                                                                                                                            25
                                                                                                                                      26
                                                                                                                27
                                                                 1                                                          28
                                                                                                                30                    29
                                                                        12                                                31
                                                                                      13                        33                    32
                                                                        15                  14
                                                                                                                          34
               0                                                                  16                                                  35
                                   0                                                        17
                                               1                        18
               0           0                                2                     19
                                           1                            21                   20
               1           3                            2
                                                                                  22
                                           4                    Long.                       23
                                                        5
               2           6
                                           7
               3                                        8
                           9
                                       10
                                                        11
        Lat.


 Figure 1: A Lat., Long. and Time 3-D model of the data


                                                                                                                     Time

                                                                                                            3

                                                                                                                 72
                                                                                                                            73
                                                                                                                                 74
                                                                                                                     76                75
                                                                                                                            77
                                                                                  2                              80              78
                                                                                                                          81           79
                                                                                      24                                         82
                                                                                            25                   84
                                                                                                  26                      85           83
                                                                                       27
                                                                                            28             38        88          86
                                                                                                                                       87
                                                                                                                                                           (a) A 3-dimensional array partitioned into
                                                    1                                 30          29                      89

                                                        12
                                                                                            31
                                                                                                  32
                                                                                                           41
                                                                                                                 92
                                                                                                                          93
                                                                                                                                 90
                                                                                                                                       91                  chunks
                                                                                      33                   44
                                                                13                          34                                   94
                                                        15           14                                                               95
                                                                             37       50          35
                                                                16                                         47
           0                                                         17                     53
                       0                                18                   40        62          56
          0                        1                            19                          65             59
                   0                    2                               20                        68                                                                                       z
                               1               3        21                   43                            71                                                                               2
          1        3                   2                        22
                                               36                    23                                                                                                             z
                               4                         49                  46                                                                                                      1                           C
                                       5                        52
          2        6                           39        61                                                                                                                                                       20
                               7                                64
                                                                      55
                                                                             58
                                                                                                                                                                          z
                                       8                             67                                                                                                    0
          3        9                           42
                           10                                                70                                                                                                                                  C
          4        48                  11                                                                                                                                                                         23
                               51              45                                                                                                                 y            C          C       C
          5        60                  54
                                                                                                                                                                      0         0          1       2
                               63              57                                                                                                                                                                C
                                       66                                                                                                                                                                         26
        Lat.                                   69                                                                                                                              C          C       C
                                                                                                                                                                  y             3          4       5
                                                                                                                                                                      1

                                                                                                                                                                  y            C          C       C
                                                                                                                                                                      2         6          7       8
               (a) A storage allocation a 3-dimensional array                                                                                                                  x           x       x
                                                                                                                                                                                0           1       2
               varying in both space and time
                                                                                                                                                            0     1             2                              25      26

                                                                                                                                                           C      C            C                 ......        C       C
                                                                                                                                                            0      1            2                               25      26

                        Dimension                       Axial Vectors
                                                                                                                                                                                         Table Map of Array Chunks


                           Lat.                0; 0; 3 1 1 S 0 4; 48; 12 3 1 S 3
                                                                                                                                                                (b) Chunked Array with its Table Map
                               Long.           0; 0; 0 0 0 S 0 3; 36; 3 12 1 S 2


                               Time            0; 0; 0 0 0 S 0 1; 12; 3 1 12 S 1 3; 72; 4 1 24 S 4
                                                                                                                                                 Figure 3: An allocation scheme of a 3-D sparse extendible
                                   Starting index of dimension
                                                                                                                                                 array
                           Starting address of hyperslab
        Vector of values of multiplying coefficents

               Memory address pointer of hyperslab


               (b) The corresponding 3 distinct Axial-Vectors


Figure 2: An example of the storage allocation scheme of
a 3-D extendible array


                                                                                                                                            75
                                                               Average Access Time for Double                                                             5
                                                                                                                                                                                                Conventional Array Function
                                       0.002                                                                                                                                                      Extendible Array Function
                                                       Conventional Array
                                                         Extendible Array
   Average Retrieval Cost (in msec)


                                                                                                                      Element Access Time (micro sec)
                                      0.0018                                                                                                              4                                                             3.87
                                                                                                                                                                                                     3.64
                                                                                                                                                                               3.47
                                      0.0016                                                                                                                          3.14               3.10                3.18
                                                                                                                                                          3
                                      0.0014

                                      0.0012                                                                                                              2

                                       0.001

                                                                                                                                                          1
                                      0.0008

                                      0.0006
                                                2          3        4        5        6          7     8                                                  0
                                                                   Number of Dimensions                                                                                2-Dim              3-Dim               4-Dim


                                                     (a) Element access costs for static array                                                                           (a) Access cost from a static array file


                                                      Avg. Access Time with Interleaved Expansions                                                                           Avg. Element Access Time from Array File
                                                                                                                                                                                    with Interleaved Expansions
                                      0.9
                                                    Conventional Array 2D                                                                                 1.4
                                                      Extendible Array 2D                                                                                                                                   2-D Array File
   Average Cost of Access (in msec)


                                      0.8
                                                                                                                       Average Time to Access (in msec)


                                                    Conventional Array 3D                                                                                                                                   3-D Array File
                                                      Extendible Array 3D                                                                                 1.2                                               4-D Array File
                                      0.7

                                      0.6                                                                                                                     1

                                      0.5                                                                                                                 0.8
                                      0.4
                                                                                                                                                          0.6
                                      0.3
                                                                                                                                                          0.4
                                      0.2

                                      0.1                                                                                                                 0.2

                                       0                                                                                                                      0
                                            0          500000       1e+06       1.5e+06     2e+06    2.5e+06                                                      0          200000       400000      600000      800000       1e+06
                                                            Size of Array in # Array Elements                                                                                   File Size in Number of Array Elements


                                            (b) Element access costs with interleaved extensions                                                              (b) Access cost from a file with interleaved extensions


Figure 4: Average time to access an element in an array of                                                          Figure 5: Average time to access an element from an ex-
type double                                                                                                         tendible array file of type double


                                                                                                               76