Structural Repairs of Multidimensional Databases∗ Sina Ariyan and Leopoldo Bertossi † Carleton University, School of Computer Science, Ottawa, Canada. {mariyan,bertossi}@scs.carleton.ca Abstract. In a multidimensional (MD) database, dimensions may be subject to semantic conditions that are not enforced by MD DBMSs or data warehouse ap- plications. Strictness and homogeneity are possibly two of them; and are crucial for the efficiency and correctness of answering MD aggregate queries and up- dating materialized aggregate views. Dimensions may become inconsistent, i.e. non-strict or heterogeneous, as the result of update operations. As a methodol- ogy to restore consistency, we propose and investigate changes to the dimension schema, as an alternative to changes on the dimension instance. We introduce the notion of minimal structural repair, and establish that under certain condi- tions, a structural repair reduces the cost wrt changing the dimension instance. We also show that it allows for a correct rewriting of queries posed to the orig- inal MD model into queries in terms of the new schema. Finally, we show how query-scoped calculated members in MDX can be used to create virtual repairs that simulate structural repairs. 1 Introduction Multidimensional databases (MDDBs) represent data in terms of dimensions and fact tables. Now, a dimension is represented as a dimension schema, i.e. a hierarchy (or lattice) of categories [13, 10], plus a dimension instance that assigns data elements to categories, organizing them in a hierarchy that parallels the category hierarchy. Facts are quantitative measures assigned to dimension instances. This MD organization al- lows users to aggregate data at different levels of granularity. Aggregation is the most important operation in OLAP applications. Performing fast query answering becomes crucial since queries are complex, involve aggregation and also much data [15]. Figure 1 shows the schema and instance of a Location dimension. The categories are: City, State, Province, Country, All. Here, for example, USA, England, UK, Canada are data elements, of the Country category. Categories (and elements therein) are partially ordered upwards. We say, for example, that Country is an ancestor category of City. If an element a is connected to an element b that be- longs to a higher category, we say that a rolls up to b. Fig. 1: Schema(left) and instance(right) of a dimension In an ideal situation, MDDBs are expected to satisfy the strictness (aka. no double counting) and homogeneity (aka. covering) conditions. These become semantic con- ∗ This research was funded by an NSERC Discovery Grant and the NSERC Strategic Network on Business Intelligence (BIN), ADC02 project. † Faculty Fellow of the IBM CAS, Toronto. straints on dimension instances. In a strict dimension, every element of a category rolls up to at most one element in a same ancestor category. In a homogeneous dimension, all elements in a category have the same structure. More precisely, if a category C has C  as an ancestor category, then all elements of C must roll up to some element in C  . The dimension instance in Figure 1 is not homoge- neous, i.e. heterogeneous, because the element London does not roll up to any element in categories State or Province. It is also non-strict, because London rolls up to the two different elements, England and UK, in the Country category. Dimension instances that do not satisfy strictness or homogeneity (or both) are said to be inconsistent. Hence, the dimension in Figure 1 is inconsistent. Dimensions may become inconsistent for several reasons, in particular, as the result of a poor design or a series of update operations. Inconsistent dimensions reduce soundness and efficiency of OLAP systems. In them it is not possible to assume that the summarizability property holds [13, 14]. In a summarizable dimension, an aggregate view defined for a parent cat- egory can be correctly derived from a set of pre-computed views for its child categories. Summarizability allows for the correct reuse of materialized aggregate views. To restore consistency, changes have to be made to the dimension schema and/or the dimension instance. In accordance with the area of consistent query answering (CQA) in relational databases [1, 3], the resulting dimension is called a repair of the original one. A minimal repair is one that minimally differs from the original dimension. As expected, minimality can be defined and characterized in different ways [3]. Previous work on repairing MDDBs has focused mainly on changing the dimension instance by modifying data elements or links between them [4, 5, 6]. Repairs obtained in this way are called data repairs. In them, the proposed approach to removing heterogeneity is the addition of different null values in the dimension instance [14, 18]. As indicated above, inconsistency may be caused by a problematic dimension sche- ma. Therefore, changing the dimension instance may be only a temporary solution. Future update operations will probably produce new inconsistencies. So, exploring and considering structural changes, i.e. changes in the schema, is a natural way to go. This is an idea that has been suggested in [14], to overcome heterogeneity. In this work we introduce and investigate the notion of structural repair, as an alternative to data repair. They modify the schema of an inconsistent dimension, to restore strictness and homogeneity. Structural repairs restrict changes to dimension instances by allowing changes that affect only the category lattice and the distribution of the data elements in different categories. In particular, there are no “data changes” (as in data repairs), in the sense that the data elements and also the edges between them remain. In the case of a ROLAP data warehouse (with relational schemas, like star or snow flake) fewer data changes on a dimension instance directly results in fewer changes on records in the underlying database. We establish that any minimal structural repair enables query rewriting: Given a query, Q, posed to the original, inconsistent MDDB D and any of its minimal repairs, D , Q can be translated into a query Q  to be posed to D  , obtaining the same results in terms of selected data elements and their respective aggregate values. In particular, if in D some summarizability properties applied or “local” homogeneity and strictness held (which, e.g. can be enforced by local semantic constraints [5]), then the query answers for the “correct” cases in D are reobtained via Q  from D  . Even more, the same (correct) summarizability properties that held in D will hold in D  . Since repairs are now consistent, summarizability will be a global property in each of them. 1 Structural repairs can be used to solve both non-strictness and heterogeneity, and have major advantages wrt to other approaches, specially when dealing with hetero- geneity. On the other hand, when fixing non-strictness, structural repairs have good properties wrt the number of changes when the parent elements in a same category in- volved in a violation of strictness (i.e. they have a descendant in common) do not have descendants in many different categories. In such undesirable cases, changing the par- ent category may have to be propagated to categories belonging to its lower subgraph, causing several additional changes (cf. Section 6). Thus, a data repair or a combination of data and structural repair may result in fewer overall changes. Structural repairs may be virtually implemented through appropriate view defini- tions, an idea we develop in this work. As a view definition language, we consider MDX, the language commonly used to query MDDBs [17, 19]. In this work we also show how query-scoped calculated members in the MDX language can be used to create virtual structural repairs. In this way, the behavior of structural repairs can be explored at query time, which is useful if a structural repair of choice is going to be materialized at some point. This paper is organized as follows: Section 2 presents the MD data model. Section 3 formalizes the notion of (minimal) structural repair. Section 4 discusses properties of an ideal repair and shows that, by satisfying certain conditions, a minimal structural repair will have those properties. Section 5 shows how virtual structural repairs can be created in MDX. Section 6 provides a discussion of previous work, a comparison between data repairs and structural repairs, directions for future work, and some concluding remarks. Proofs of results, additional examples, and a brief introduction to MDX can be found in an extended version. 2 2 The Multidimensional Model In this work we adopt as a basis the Hurtado-Mendelzon model of MDDBs [10, 13]. However, since we will deal with changes in the schema, we will use a two-sorted structural representation of a MDDB (as in many-sorted predicate logic [8]). This will flatten out the representation and will allow us to talk about categories as elements of the structure (as opposed to sets of elements). In consequence, a dimension is a set-theoretic structure of the form S = U, C U , E U , LU U U U U C , LE , EC , where C and E are unary relations, and L UC , L U E , EC U are binary relations, all of them over the universe U . More precisely: 1. U is a possibly infinite and non-empty set that is the disjunct union of a set of categories and a set of data elements, say U = U C ∪ UE . 2. C U ⊆ UC is a finite, non-empty set of “active” categories. In the example above, C U = {City, State, Province, Country, All}. 3. E U ⊆ UE is a finite, non-empty set of “active” data elements. In the example, E U = {LA, ..., all}. 4. LUC is the child/parent relationship between categories, i.e. edges between cate- gories in the dimension schema. In the example, L U C ={(City, State), (State, Country), . . . , (Country, All)}. 1 Actually, all these claims still hold under slightly softer conditions than those imposed by the notion of minimal repair. 2 http://people.scs.carleton.ca/∼bertossi/papers/strucRep18Ext.pdf 5. LUE is the child/parent relationship between data elements, i.e. edges between data elements in the dimension instance. In the example, L U E = {(LA,California), (California, USA), . . . , (Canada,all)}. 6. EC U is a relation that maps data elements to their categories. In the example, EC U = {(LA,City), (California, State), . . . , (all,All)}. all is the only “element” of All. 3 From a dimension S as above we can define its dimension schema structure: K(S) := UC , C U , LU U U U C . We can also define its dimension instance: M (S) := U, E , LE , EC . Loosely speaking, we can say that “M (S) is the instance for schema K(S)”. For a structure S to represent a valid MDM, it has to satisfy the following basic MD conditions (BMDCs): (below e i and ci represent members of the universe and R ∗ denotes the transitive closure of a binary relation R): U 1. C U ∩ E U = ∅. 2. LU U U U U C ⊆ C × C . 3. LE ⊆ E × E . 4. EC U ⊆ E U × C U. 5. For all e1 , e2 , c1 , c2 : If (e1 , e2 ) ∈ LE and (e1 , c1 ) ∈ EC and (e2 , c2 ) ∈ EC U, U U then (c1 , c2 ) ∈ LU C. 6. For all c1 : If c1 ∈ C U, then (c1 , c1 ) ∈ / (LU ∗ C) . 7. For all e1 : If e1 ∈ E , then there exists c1 with (e1 , c1 ) ∈ EC U. U 8. For all c1 , c2 : If there is e with (e, c1 ) ∈ EC U and (e, c2 ) ∈ EC U , then c1 = c2 . 9. For all c1 , c2 : c1 = c2 iff, for all e ∈ E U, it holds: (e, c1 ) ∈ EC U iff (e, c2 ) ∈ EC U . Notice that from these conditions it follows that 2. an 3. are proper inclusions. Definition 1. Dimension S = U, C U , E U , LU U U C , LE , EC  is: (a) Strict if, for all e1 , e2 , e3 , c : If (e1 , e2 ) ∈ (LE ) , (e1 , e3 ) ∈ (LE ) , (e2 , c) ∈ EC U , (e3 , c) ∈ EC U , U ∗ U ∗ then e2 = e3 . (b) Homogeneous if, for all e 1 , c1 , c2 : If (e1 , c1 ) ∈ EC U , (c1 , c2 ) ∈ LU C, then there is e2 , with (e2 , c2 ) ∈ EC U and (e1 , e2 ) ∈ LU E . (c) Consistent if it satisfies the two previous conditions; and inconsistent otherwise.  3 Structural Repairs Definition 2. A data repair for an inconsistent dimension S = U, C U , E U , LU U C , LE , U  U , LU , L EC U  is a structure S  = U, C U , E  U , EC   that satisfies the dimension C E 4 conditions and the BMDCs, and is strict and homogeneous.  Notice that S and S  have the same universe, the same categories, and the same category links (but other relations may differ). We can see that, in order to obtain a data repair, one can only add or remove data elements, change links between elements or move elements between different categories. The following notion of structural repair (SR) is given in terms of a new MD structure, possibly with a new schema, and a mapping that establishes the correspondence between the original dimension and the new structure. Definition 3. A structural repair for an inconsistent dimension S = U, C U , E U , U LU U U   U U U U  C , LE , EC  is a pair S , g, with S a structure U, C , E , LC , LE , EC , with the following properties: (a) S  is strict and homogeneous. (b) Elements cannot move 3 We assume that in MDM all dimensions have this top category All with the single element all, to which all the other elements roll up. 4 In the following this will be implicity assumed. between categories that exist both in S and S  : For all e, c1 , c2, if (e, c1 ) ∈ EC U , (e, c2 ) U  , c1 = c2 , then {c1 , c2 }  (C U ∩ C ∈ EC  U ). (c) Any new category in S  must have U at least one data element. (d) g : C U → 2C , the schema mapping, is a total function that maps each category of S to a set of categories in CU (which is finite in S  ). (e) If   c ∈ g(c), then c and c share at least one data element.  The role of g is to establish a relationship between the schemas of S and S  . Notice that, for each category c, the set of its “elements”, i.e. {e | (e, c) ∈ EC U }, may be  U and c ∈ g(c)}). ⊆-incomparable with {e | (e, c  ) ∈ EC Fig. 2: An SR for the dimension of Figure 1 Fig. 3: Another SR for the dimension of Figure 1 A structural repair has the same domain, including categories and elements, as the initial dimension. However, the finite sets of active categories, i.e. C U , CU , resp., may be different. On the other side, the finitely many active elements are the same. Condition (b) in Definition 3 ensures that in a structural repair, data elements move from one category to another only as a result of changes made to the dimension schema (splitting or merging categories of S). 5 Figures 2 and 3 show two of the possible structural repairs for the inconsistent dimension of Figure 1. They show that strictness forces us to put elements England and UK in different categories. On the other hand, homogeneity forces us to either merge categories State and Province or isolate element LA in a new category. Proposition 1. Every inconsistent dimension has a structural repair.  This proposition can be proved by using a simple structural repair that essentially cre- ates a new one-element category for each element in the dimension instance, and also links between the newly created categories whenever their single elements are con- nected in the original instance. Figure 4 illustrates this kind of structural repair as a third repair for the instance in Figure 1. 5 Isomorphic structural repairs, i.e. that differ only in the names of active categories, will treated as being the same repair. Example 1. The following mapping is a schema mapping between the dimension of Figure 1 and the structural repair of Figure 2: g : City → {City1, City2}, State → {StateProv}, Province → {StateProv}, Country → {Country1, Country2}, All → {All}.  Fig. 4: Another SR for instance in Figure 1 Now we define minimal structural repairs (MSRs), as preferred repairs among the struc- tural repairs. This requires comparing structural repairs. The data movement set, defined next, has useful properties for enabling this comparison (cf. Section 4). Definition 4. Let S  , g be an SR for dimension S, and c a category of S. (a) The data movement set of category c is defined by:   U }, DMSet g (c) = {e | (e, c) ∈ EC U }  {e | (e, c ) ∈ EC c ∈g(c) where denotes the symmetric difference oftwo sets. (b) The overall data movement set between S and S  is DMSet g (S, S  ) := c∈C U DMSet g (c).  Example 2. (example 1 cont.) For the schema mapping in the example: DMSet g (City) = ∅, DMSet g (State) = {BC, ON}, DMSet g (Province) = {California}, DMSet g (Country) = ∅, DMSet g (All) = ∅.  Intuitively, an MSR is a new dimension that is obtained by applying a minimal set of changes to the schema of an inconsistent dimension. Inspired by the notion of priori- tized minimization [16], we propose to minimize both data movement and the changes in the set of categories, but assigning higher priority to minimizing the former. Definition 5. For a dimension S and two SRs S 1 , g1  and S2 , g2 : S1 , g1  ≤S S2 , g2  iff DMSet g1 (S, S1 ) ⊆ DMSet g2 (S, S2 ) and   DMSet g1 (S, S1 ) = DMSet g2 (S, S2 ) ⇒ (C S C S1 ) ⊆ (C S C S2 ).   Here, C S , C S1 and C S2 denote the finite sets (C U ) of active categories for structures S, S1 and S2 , respectively.  Definition 6. S1 , g1  is a minimal structural repair (MSR) of dimension S iff it is an SR of S and there is no other SR S 2 , g2  for S, such that S 2 , g2