Consistent Query Answering in Data Warehouses Leopoldo Bertossi1 , Loreto Bravo2 and Mónica Caniupán3 1 Carleton University, Canada 2 Universidad de Concepción, Chile 3 Universidad de Bio-Bio, Chile Abstract. A Data Warehouse (DW) is a data repository that organizes and phys- ically integrates data from multiple sources under special kinds of schemas. A DW is composed by a set of dimensions that reflect the way the data is struc- tured, and the facts that correspond to quantitative data related with the dimen- sions. A dimension schema is a hierarchical graph of categories. A dimension instance is strict if every element of the dimension has a unique ancestor element in each of the ancestor categories. This property is crucial for the efficiency of the system since it allows for the correct computation of aggregate queries using pre-computed views. A dimension instance may become non-strict after update operations. When this happens, the instance can be minimally repaired in several ways. In this paper we characterize consistent answers to aggregate queries by means of smallest ranges that contain the answers obtained from every minimal repair. We also introduce the notion of canonical dimension which captures in- formation about all the minimal repairs. We use this dimension to approximate consistent query answers. 1 Introduction Data Warehouses (DWs), or more generally, multidimensional databases, are data repos- itories that integrate data from different sources, and keep historical data for analysis and decision support [7]. DWs represent data according to dimensions and facts. The former reflect the perspectives from which data are viewed, and we may have several of them. The latter corresponds to data (also known as measures) which are generally quantitative and are associated to the different dimensions. Facts can be aggregated, filtered and referenced using the dimensions. As an illus- tration, the facts related to the sales of a company may be associated to the dimensions time and location, and should be understood as the sales at certain locations in certain periods of time. A dimension schema is usually a hierarchical lattice of category names. A dimension instance for the schema assigns sets or extensions to the category names, and also imposes a lattice like structure between elements of different categories. As an example, the dimension time could be represented by the schema: date→month →year. The multidimensional structure of a DW allows users to formulate aggregate queries at different levels of granularity. Example 1. A company that manages an online Chilean phone call repository created a Phone Traffic DW, with dimensions Time and Phone with the schema shown in Figure 1(a). In the Time dimension, each Date is associated to a Month and each month is as- sociated to a Year which is associated to a category All. On the other hand, in the Phone dimension, each Number is associated to an AreaCode and to a City. Both AreaCode and City are connected to a Region. The top category is All. All All All all Calls Region Number Date In Out Year Region IX VIII N1 Jan 1,07 3 0 N2 Jan 1,07 2 1 AreaCode City N3 Jan 1,07 5 5 Month AreaCode City 45 41 TCH TEM CCP N1 Jan 2,07 8 0 Number N2 Jan 2,07 0 3 Date Number N1 N2 N3 N3 Jan 2,07 5 1 (a) Dimensions schemas (b) Phone dimension instance (c) Fact table Fig. 1. Phone Traffic DW Figure 1(b) shows a dimension instance for the Phone schema. In this instance, TCH (Talcahuano), TEM (Temuco) and CCP (Concepcion) are elements of category City, and IX and VIII are elements of Region. The facts stored in the Phone Traffic DW correspond to number of incoming and outgoing calls of a phone number at a given date, e.g. number N1 received three calls and made no calls on January 1, 2007. The fact table is shown in Figure 1(c). With this DW we can easily answers queries such as: How many outgoing calls were there on May 2007 per city? How many outgoing calls are there per region every year? 2 Generally, a dimension (instance) is required to be strict, this is, every element of a category should reach no more that one element in each ancestor category [17, 13]. For example, in the Phone Traffic DW, we expect this property to hold since each number should be associated with a unique city, region and area code. If a dimension instance satisfies its strictness constraint, we say that it is consistent. If dimensions are stored as relational tables, strictness can be imposed by a set of functional dependencies over the tables, as the following example shows. Example 2. The dimension instance in Figure 1 (b) is strict, since, as expected, every number rolls-up to a unique area code, city and region. On the other hand, dimension in Figure 2 is not strict since now element N3 rolls-up to both IX and VIII in category Region which shows that the data is not accurate and also implies that pre-computed answers at the level of AreaCode and City cannot be used to compute answers for Re- gion. The relational table in Figure 2 maps the edges between the elements of categories Number and Region. Since the functional dependency of Region upon Number does not hold in that table, the dimension instance is not strict. 2 Dimensions that are strict and homogenous (cf. Section 2) allow for the correct use of pre-computed answers at low level categories to compute aggregate queries at higher levels. Thus, query answering over strict dimensions increases efficiency of DWs [22]. Even though strictness is important, DWs do not enforce it, and a dimension might become non-strict after an update performed to adapt to changes in data sources or modifications to the business rules [15, 14, 20]. In an enterprise DW, with possibly ter- abytes of data [7], ensuring strictness of dimensions may be vital for efficient query answering and keeping the data clean. In [6] the concept of minimal repair is formalized, as a strict dimension instance that minimally differs from the given one by a minimum number of changes. Also logic programs to specify and compute them are provided. In that work, the focus is on how all IX VIII RRegion Number N1 VIII N2 IX 45 41 TCH TEM CCP N3 IX N3 VIII N1 N2 N3 Fig. 2. Non-strict dimension instance (category names are omitted) to aid the developer or administrator to restore consistency by finding a single repair. However, there might be several alternative repairs, and it is not always possible to know which one is the desirable one. Here we focus on answering aggregate queries that involve inconsistent dimension instances. In order to obtain semantically meaningful answers, we use the class of all minimal repairs to provide a minimal numerical range to which the answer to the query should belong. With this kind of answer we capture information that is shared by, or invariant under, the minimal repairs of the original instance. This is a form of consistent query answering (CQA) [3], a problem that has been investigated in the relational setting (cf. [5, 8] for recent surveys). In this paper, a consis- tent answer to an aggregate query captures through an interval the answers to the same query that would be obtained from each of the minimal repairs if they were material- ized. We also introduce the notion of a canonical dimension instance for the original, possibly inconsistent, dimension instance, and we show how to obtain it. This dimen- sion captures information about all the minimal repairs, and can be used to approximate the consistent query answers. The rest of the paper is organized as follows: Section 2 presents the multidimen- sional model. Next, Section 3 defines repairs of dimension instances and consistent query answer to aggregate queries. Section 4 presents the canonical dimension instance. Related work and some conclusions are discussed in Section 5. This paper presents some initial and ongoing research on dealing with inconsistent DW dimensions. 2 The Multidimensional Model In this section we present the multidimensional database model that we use as the ba- sic framework for our research. It is described in detail in [13]. A dimension schema S consists of a pair (C, %), where C is a set of categories, and % is a child/parent relation between categories. The dimension schema can be also represented with a di- rected acyclic graph where the vertices correspond to the categories and the edges to the child/parent relation. The transitive and reflexive closure of % is denoted by % ∗ . There are no shortcuts in the schemas, this is, if Ci % Cj there is no category Ck such that Ci %∗ Ck and Ck %∗ Cj . Every dimension schema contains a distinguished top category called All which is reachable from all other categories, i.e. for every C ∈ C, C%∗ All. The leaf categories are called bottom categories. To simplify the presentation and without loss of generality, we assume categories do not have attributes, and schemas have a unique bottom category. Example 3. The Phone dimension schema S = (C, %) of Figure 1(a) is defined by: C={Number, AreaCode, City, Region, All}, %= {(Number, AreaCode), (Number, City), (AreaCode, Region), (City, Region), (Region, All)}. The relation %∗ is % ∪ {(Number, Number), (Number, Region), (Number, All), . . . }. The bottom category is Number, and its ancestors are AreaCode, City, Region, and All. 2 A dimension (instance) D over a dimension schema S = (C, %) is a tuple (M, <), such that: (i) M is a finite collection of ground atoms of the form C(a) where C ∈ C and a is a constant. If C(a) ∈ M, a is said to be an element of C. The constant all is the only element of category All. Categories are assumed to be disjoint, i.e. if C i (a), Cj (a) ∈ M then i = j. There is a function δ that maps elements to categories so that δ(a) = C i iff Ci (a) ∈ M. (ii) The relation < contains the child/parent relationships between elements of different categories, and is compatible with %: If a < b, then δ(a) % δ(b). We denote with <∗ the reflexive and transitive closure of <. C The roll-up relation between categories Ci and Cj , denoted RCji (D) consists of the set of pairs {(a, b) | Ci (a), Cj (b) ∈ M and a < b}. When the dimension is clear from ∗ C the context, we denote the roll-up relation simply by RCji . A dimension instance D is said to be homogeneous [13] if, for every pair of cate- gories Ci % Cj and element a in Ci , there is an element b in Cj such that a < b. In what follows we restrict ourselves to homogeneous dimensions. Moreover, a dimension D = (M, <) is strict if, for every different elements a, b, c with a <∗ b and a <∗ c, it holds δ(b) 6= δ(c) [12]. In other words, strictness ensures that every relation < ∗ be- tween two categories is functional. In the paper, we will say that a dimension instance is inconsistent if it is not strict. Example 4. The dimension instance in Figure 1(b) is defined over the Phone dimension schema in Figure 1(a), and consists of: M = {Number(N1 ), Number(N2 ), Number(N3 ), AreaCode(45), AreaCode(41), City(TCH), City(TEM), City(CCP), Region(IX), Region(VIII), All(all)}, < = {(N1 ,41), (N2 ,45), (N3 ,41), (N1 ,TCH), (N2 ,TEM), (N3 ,CCP), (45,IX), (41,VIII), (TCH , VIII), (TEM , IX), (CCP, VIII), (IX,all), (VIII,all)}. Relation <∗ contains all the elements in < plus others, such as (N1 ,N1 ) and (N1 ,VIII). The dimension instance in Figure 1(b) is both homogeneous and strict. In compari- son, the dimension instance in Figure 2 is homogeneous, but not strict since N 3 rolls-up to both IX and VIII in category Region. 2 A dimension instance is sumarizable if it allows to compute answers to aggregate queries using other pre-computed queries. As an illustration, consider the DW in Figure 1, the query that request the number of calls grouped by Region, can be computed by using pre-computed answers at category AreaCode or City, if any. This will be more efficient since the pre-computed tables will be smaller than the fact tables. A dimension is summarizable if it is both homogeneous and strict [17]. 4 Due to our homogeneity assumption, to ensure summarizability we only need strictness. For instance, the dimension instance in Figure 1(b) is summarizable since it is homogeneous and strict. In contrast, the dimension instance in Figure 2 is not summarizable since it is not strict. If a DW is not summarizable, it will either return incorrect answers if 4 A requirement for summarizability which is not related to the dimension but to the query is that only distributive aggregate functions should be used (e.g. MAX, MIN, SUM, and COUNT). using pre-computed views, or it will lose efficiency by needing to compute the answers starting from the bottom category. 3 Repairs and Consistent Query Answering In this section we first introduce the concept of minimal repair [6], and next, we define consistent query answers using minimal repairs as a basis. Intuitively, a minimal repair is a new instance that is strict and is obtained by a minimum number of changes to the original roll-up relation. To compare different repairs, we use the distance between the given inconsistent instance and its repairs. Let D =(M, 1, the roll-up relation between a and category C is involved in an inconsistency. On the other hand, D = (M, <) defined over S = (C, %) is consistent iff for every a ∈ M and C ∈ C, it holds that RD (a, C) ≤ 1. all all IX VIII IX VIII {VIII,IX} 45 41 TCH TEM CCP 45 41 {45,41} TCH TEM CCP {TEM,CCP} N1 N2 N3 N1 N2 N3 (a) Core dimension (b) Canonical dimension Fig. 4. Core and canonical dimensions for the dimension in Figure 2 Definition 4. Given a dimension instance D = (M,