Logic Programs for Repairing Inconsistent Dimensions in Data Warehouses Loreto Bravo1 , Mónica Caniupán2 , and Carlos Hurtado3 1 Universidad de Concepción, Chile 2 Universidad de Bio-Bio, Chile 3 Universidad Adolfo Ibáñez Abstract. A Data Warehouse (DW) is a data repository that integrates data from multiple sources and organizes the data according to a set of data structures called dimensions. Each dimension provides a perspec- tive upon which the data can be viewed. In order to support an efficient processing of queries, a dimension is usually required to satisfy different classes of integrity constraints. In this paper, we study the problem of re- pairing a dimension when it fails to satisfy a set of two classes of integrity constraints: strictness constraints and covering constraints. We introduce the notion of minimal repair of a dimension in this context. A minimal repair is defined as a new dimension that is consistent with respect to the integrity constraints, which is obtained by applying a minimal amount of updates to the original dimension. We study the complexity of computing minimal repairs. Finally, we show how to characterize and compute min- imal repairs of a dimension using Datalog programs with stable model semantics. 1 Introduction Data Warehouses (DWs) are data repositories that integrate data from different sources, and keep historical data for analysis and decision support [10]. When generating reports, it is of central importance for these systems to compute queries that are summaries of data in a simple and efficient way. In order to do this, DWs organize data according to the multidimensional model. In the multidimensional model, dimensions reflect the perspectives upon which facts are viewed. Facts correspond to events which are usually associated to numeric values known as measures, and are referenced using the dimension elements. Dimensions are modeled as hierarchies whose nodes are called elements, where each element belongs to a category. The categories are also organized into a hierarchy called hierarchy schema. Example 1. Consider a company that manages an online repository of research articles. The company maintains a DW to generate summaries used to analyze the download behavior of its users. The DW contains the dimensions Time and Publication. The Time dimension is structured using a hierarchy schema with a bottom category Date, which goes to the category Month, which in turn goes to the category Year. On the other hand, the Publication dimension is structured using the hierarchy schema shown in Figure 1(a), where the Article category goes to Journal, which in turn goes to Area. Also the Article category is connected All all Downloads Area Subject c1 c2 d1 d2 Article Date N a1 01-01-2007 3 Journal b1 b2 a2 01-01-2007 2 a1 02-01-2007 5 Article a1 a2 a2 02-01-2007 5 (a) Publication dimension: (b) Publication dimension: ele- (c) Fact table hierarchy schema ments and rollup relations Fig. 1. Dimension for Example 1 to Subject category. The top category is All. Figure 1(b) shows the elements of the Publication dimension, along with the relations between elements of different categories, called rollup relations. For example, a1 and a2 are elements of category Article and b1 , b2 are elements of category Journal. For each edge in the hierarchy schema, there is a rollup relation. For example, the rollup relation from the category Article to the category Journal contains the pairs or elements: (a1 , b1 ) and (a2 , b2 ). The fact table of the DW is shown in Figure 1(c). Each fact stored in the table represents the numbers of times an article was downloaded in a given date. As an example, article a1 was downloaded three times on January 1, 2007. The hierarchical structure of dimensions allows users to access facts at different levels of granularity. As an example, using the aforementioned DW it is easy to compute summaries such as: number of times that each article was downloaded per month, or number of downloads broken down by area and year. 2 In order to compute summaries efficiently, DWs use pre-computed summaries at low level categories to derive summaries at higher level categories. Two main classes of integrity constraints, strictness and covering constraints, are used to check whether such computations, called summarizations, are correct [13, 19, 25]. In a consistent summarization, each fact is aggregated once and not more than once. Strictness constraints are used to require rollup relations to be strict (many- to-one) relations. If a rollup relation is required to be strict, then there cannot exist an element connected to two different elements in the rollup relation. As an example, in the Publication dimension (Figure 1(a) and (b)) the rollup relation from Article to Journal is strict. In contrast, the rollup relation from Article to Subject is not strict, because a2 is connected to two different elements in the category Subject. Covering constraints are used to require a rollup relation to connect all the elements that belongs to the category from which the rollup relation departs. As an example, in the dimension of Figure 1(b), the rollup relation from Article to Subject it is not covering since the element a1 is not connected to any element in Subject. Example 2. The summary SArea (Area, N ) = {(c1 , 8), (c2 , 7)} (Figure 1) can be correctly derived from the summary SArticle (Article, N ) = {(a1 , 8), (a2 , 7)} by sum- ming up the number of downloads for each group of articles that go to the same area in the dimension (element a1 goes to c1 , and a2 goes to c2 ). The cor- rectness of this derivation follows from the fact that the rollup relation from All all all all Area Subject c1 c2 d1 c1 c2 d1 c1 c2 d1 Journal b1 b2 b1 b2 b1 b2 Article a1 a2 a1 a2 a1 a2 (a) Dimension D (b) Repair D1 (c) Repair D2 Fig. 2. A Publication dimension inconsistent with respect to Article→Area and Article⇒All Article to Area is strict and covering. Similarly, SArea can be correctly derived from SJournal (Journal, N ) = {(b1 , 8), (b2 , 7)}. However, the summary SAll (All, N ) = {(all, 15)} cannot be correctly derived from SSubject (Subject, N ) = {(d1 , 0), (d2 , 7)} since such derivation would yield SAll (All, N ) = {(all, 7)}. This derivation is incor- rect because the rollup relation from Article to Subject is not covering. 2 It has been shown that the dimensions that arise in real-world applications do not always have strict and covering rollup relations. The flexibility to represent rollup relations has been incorporated as a central feature of several dimension models [13, 24]. In order to keep the ability to check consistency of summarizations, it is important for DW systems to know the set of strictness and covering constraints that hold. This can be achieved by allowing DW designers to formulate the constraints that must hold when the dimensions are modeled. In this paper, we study the form of inconsistency that arises when a dimension does not satisfy a set of strictness and covering constraints. Similar to the vast majority of the work on inconsistency handling in databases (cf. [4]), we address the scenario where the constraints prevail over the data and the dimension must be updated in order for it to satisfy the constraints. Example 3. Suppose that the DW administrator considers that the following constraints should be satisfied by the dimension D in Figure 2: ϕ1 : Article ⇒ All, ϕ2 : Journal ⇒ All, ϕ3 : Area ⇒ All, ϕ4 : Subject ⇒ All, which state that the rollup relations from categories Article, Journal, Area and Subject to All are covering, and the constraint ϕ5 : Article → Area establishing that the rollup relation from Article to Area is strict. However, dimension D fail to satisfy constraint ϕ5 , since element a1 goes to two different elements: c1 and c2 in the Area category. Also, D violates ϕ1 because element a2 does not reach all. The DW administrator considers that the constraints properly capture the semantics of the data, and therefore their violation indicates the presence of errors. Therefore, the dimension should be repaired (corrected) in order to resolve the inconsistency. Figure 2(b)-(c) are two possible repairs of D. Repair D1 is obtained by deleting edge (a1 , b2 ) to remove the violation of ϕ5 and inserting (a2 , d1 ) to remove the violation of ϕ1 . Repair D2 is obtained by deleting edge (a1 , b1 ) and inserting (a2 , d1 ). 2 We introduce the notion of a minimal repair of a dimension that does not satisfy a set of constraints. A minimal repair is defined as a new dimension that satisfies the constraints and is obtained by applying a minimal number of changes to the original dimension. We study the complexity of the problem of computing min- imal repairs, and show that in general the problem is NP-hard. We also explore the use of Datalog programs with stable model semantics [11] to characterize minimal repairs. Logic programs act as a compact and executable logical specifi- cation of dimension repairs. The rest of the paper is organized as follows: Section 2 presents DW dimensions, and strictness and covering constraints. Next, Sec- tion 3 presents the notion of repair and describes complexity issues. Section 4 presents the logic programs to compute repairs. Finally, Sections 5 and 6 present related work and the conclusions of the paper. 2 Preliminaries In this section we formalize dimensions using standard concepts from multidi- mensional models obtained from [9, 15, 17, 24], with some minor modifications to facilitate presentation. A hierarchy schema H consists of a pair (CH , րH ), where (CH , րH ) is an acyclic directed graph. Vertices in the set CH are categories and the edges րH represent the child/parent relations between categories. The transitive and re- flexive closure of րH is denoted by ր∗H . The set of categories CH contains a distinguished top category denoted AllH , which is reachable from every other category in CH and has no outgoing edges, that is, there is no category ci ∈ CH such that (AllH , ci ) ∈րH and for every cj ∈ C, (cj , AllH ) ∈ր∗H . Sometimes, we will write ca րH cb instead of (ca , cb ) ∈րH . Example 4. The hierarchy schema H = (CH , րH ), depicted in Figure 1(a), is as follows: CH = {Article, Journal, Subject, Area, All}; AllH = All; and րH ={ (Article, Journal), (Journal, Area), (Area, All), (Article, Subject), (Subject, All)}. 2 A dimension D is a tuple (HD , ED , CatD ,