=Paper=
{{Paper
|id=None
|storemode=property
|title=Logic Programs for Repairing Inconsistent Dimensions in Data Warehouses
|pdfUrl=https://ceur-ws.org/Vol-619/paper9.pdf
|volume=Vol-619
|dblpUrl=https://dblp.org/rec/conf/amw/BravoMH10
}}
==Logic Programs for Repairing Inconsistent Dimensions in Data Warehouses==
Logic Programs for Repairing Inconsistent
Dimensions in Data Warehouses
Loreto Bravo1 , Mónica Caniupán2 , and Carlos Hurtado3
1
Universidad de Concepción, Chile
2
Universidad de Bio-Bio, Chile
3
Universidad Adolfo Ibáñez
Abstract. A Data Warehouse (DW) is a data repository that integrates
data from multiple sources and organizes the data according to a set of
data structures called dimensions. Each dimension provides a perspec-
tive upon which the data can be viewed. In order to support an efficient
processing of queries, a dimension is usually required to satisfy different
classes of integrity constraints. In this paper, we study the problem of re-
pairing a dimension when it fails to satisfy a set of two classes of integrity
constraints: strictness constraints and covering constraints. We introduce
the notion of minimal repair of a dimension in this context. A minimal
repair is defined as a new dimension that is consistent with respect to the
integrity constraints, which is obtained by applying a minimal amount of
updates to the original dimension. We study the complexity of computing
minimal repairs. Finally, we show how to characterize and compute min-
imal repairs of a dimension using Datalog programs with stable model
semantics.
1 Introduction
Data Warehouses (DWs) are data repositories that integrate data from different
sources, and keep historical data for analysis and decision support [10]. When
generating reports, it is of central importance for these systems to compute
queries that are summaries of data in a simple and efficient way. In order to
do this, DWs organize data according to the multidimensional model. In the
multidimensional model, dimensions reflect the perspectives upon which facts
are viewed. Facts correspond to events which are usually associated to numeric
values known as measures, and are referenced using the dimension elements.
Dimensions are modeled as hierarchies whose nodes are called elements, where
each element belongs to a category. The categories are also organized into a
hierarchy called hierarchy schema.
Example 1. Consider a company that manages an online repository of research
articles. The company maintains a DW to generate summaries used to analyze
the download behavior of its users. The DW contains the dimensions Time and
Publication. The Time dimension is structured using a hierarchy schema with a
bottom category Date, which goes to the category Month, which in turn goes
to the category Year. On the other hand, the Publication dimension is structured
using the hierarchy schema shown in Figure 1(a), where the Article category goes
to Journal, which in turn goes to Area. Also the Article category is connected
All all
Downloads
Area Subject c1 c2 d1 d2
Article Date N
a1 01-01-2007 3
Journal b1 b2
a2 01-01-2007 2
a1 02-01-2007 5
Article a1 a2 a2 02-01-2007 5
(a) Publication dimension: (b) Publication dimension: ele- (c) Fact table
hierarchy schema ments and rollup relations
Fig. 1. Dimension for Example 1
to Subject category. The top category is All. Figure 1(b) shows the elements of
the Publication dimension, along with the relations between elements of different
categories, called rollup relations. For example, a1 and a2 are elements of category
Article and b1 , b2 are elements of category Journal. For each edge in the hierarchy
schema, there is a rollup relation. For example, the rollup relation from the
category Article to the category Journal contains the pairs or elements: (a1 , b1 )
and (a2 , b2 ). The fact table of the DW is shown in Figure 1(c). Each fact stored
in the table represents the numbers of times an article was downloaded in a given
date. As an example, article a1 was downloaded three times on January 1, 2007.
The hierarchical structure of dimensions allows users to access facts at different
levels of granularity. As an example, using the aforementioned DW it is easy to
compute summaries such as: number of times that each article was downloaded
per month, or number of downloads broken down by area and year. 2
In order to compute summaries efficiently, DWs use pre-computed summaries
at low level categories to derive summaries at higher level categories. Two main
classes of integrity constraints, strictness and covering constraints, are used to
check whether such computations, called summarizations, are correct [13, 19, 25].
In a consistent summarization, each fact is aggregated once and not more than
once. Strictness constraints are used to require rollup relations to be strict (many-
to-one) relations. If a rollup relation is required to be strict, then there cannot
exist an element connected to two different elements in the rollup relation. As an
example, in the Publication dimension (Figure 1(a) and (b)) the rollup relation
from Article to Journal is strict. In contrast, the rollup relation from Article to
Subject is not strict, because a2 is connected to two different elements in the
category Subject. Covering constraints are used to require a rollup relation to
connect all the elements that belongs to the category from which the rollup
relation departs. As an example, in the dimension of Figure 1(b), the rollup
relation from Article to Subject it is not covering since the element a1 is not
connected to any element in Subject.
Example 2. The summary SArea (Area, N ) = {(c1 , 8), (c2 , 7)} (Figure 1) can be
correctly derived from the summary SArticle (Article, N ) = {(a1 , 8), (a2 , 7)} by sum-
ming up the number of downloads for each group of articles that go to the
same area in the dimension (element a1 goes to c1 , and a2 goes to c2 ). The cor-
rectness of this derivation follows from the fact that the rollup relation from
All all all all
Area Subject c1 c2 d1 c1 c2 d1 c1 c2 d1
Journal b1 b2 b1 b2 b1 b2
Article a1 a2 a1 a2 a1 a2
(a) Dimension D (b) Repair D1 (c) Repair D2
Fig. 2. A Publication dimension inconsistent with respect to Article→Area and Article⇒All
Article to Area is strict and covering. Similarly, SArea can be correctly derived
from SJournal (Journal, N ) = {(b1 , 8), (b2 , 7)}. However, the summary SAll (All, N ) =
{(all, 15)} cannot be correctly derived from SSubject (Subject, N ) = {(d1 , 0), (d2 , 7)}
since such derivation would yield SAll (All, N ) = {(all, 7)}. This derivation is incor-
rect because the rollup relation from Article to Subject is not covering. 2
It has been shown that the dimensions that arise in real-world applications do not
always have strict and covering rollup relations. The flexibility to represent rollup
relations has been incorporated as a central feature of several dimension models
[13, 24]. In order to keep the ability to check consistency of summarizations, it is
important for DW systems to know the set of strictness and covering constraints
that hold. This can be achieved by allowing DW designers to formulate the
constraints that must hold when the dimensions are modeled. In this paper, we
study the form of inconsistency that arises when a dimension does not satisfy
a set of strictness and covering constraints. Similar to the vast majority of the
work on inconsistency handling in databases (cf. [4]), we address the scenario
where the constraints prevail over the data and the dimension must be updated
in order for it to satisfy the constraints.
Example 3. Suppose that the DW administrator considers that the following
constraints should be satisfied by the dimension D in Figure 2: ϕ1 : Article ⇒ All,
ϕ2 : Journal ⇒ All, ϕ3 : Area ⇒ All, ϕ4 : Subject ⇒ All, which state that the rollup
relations from categories Article, Journal, Area and Subject to All are covering, and
the constraint ϕ5 : Article → Area establishing that the rollup relation from Article
to Area is strict. However, dimension D fail to satisfy constraint ϕ5 , since element
a1 goes to two different elements: c1 and c2 in the Area category. Also, D violates
ϕ1 because element a2 does not reach all. The DW administrator considers that
the constraints properly capture the semantics of the data, and therefore their
violation indicates the presence of errors. Therefore, the dimension should be
repaired (corrected) in order to resolve the inconsistency. Figure 2(b)-(c) are
two possible repairs of D. Repair D1 is obtained by deleting edge (a1 , b2 ) to
remove the violation of ϕ5 and inserting (a2 , d1 ) to remove the violation of ϕ1 .
Repair D2 is obtained by deleting edge (a1 , b1 ) and inserting (a2 , d1 ). 2
We introduce the notion of a minimal repair of a dimension that does not satisfy
a set of constraints. A minimal repair is defined as a new dimension that satisfies
the constraints and is obtained by applying a minimal number of changes to the
original dimension. We study the complexity of the problem of computing min-
imal repairs, and show that in general the problem is NP-hard. We also explore
the use of Datalog programs with stable model semantics [11] to characterize
minimal repairs. Logic programs act as a compact and executable logical specifi-
cation of dimension repairs. The rest of the paper is organized as follows: Section
2 presents DW dimensions, and strictness and covering constraints. Next, Sec-
tion 3 presents the notion of repair and describes complexity issues. Section 4
presents the logic programs to compute repairs. Finally, Sections 5 and 6 present
related work and the conclusions of the paper.
2 Preliminaries
In this section we formalize dimensions using standard concepts from multidi-
mensional models obtained from [9, 15, 17, 24], with some minor modifications to
facilitate presentation.
A hierarchy schema H consists of a pair (CH , րH ), where (CH , րH ) is an
acyclic directed graph. Vertices in the set CH are categories and the edges րH
represent the child/parent relations between categories. The transitive and re-
flexive closure of րH is denoted by ր∗H . The set of categories CH contains a
distinguished top category denoted AllH , which is reachable from every other
category in CH and has no outgoing edges, that is, there is no category ci ∈ CH
such that (AllH , ci ) ∈րH and for every cj ∈ C, (cj , AllH ) ∈ր∗H . Sometimes, we
will write ca րH cb instead of (ca , cb ) ∈րH .
Example 4. The hierarchy schema H = (CH , րH ), depicted in Figure 1(a), is as
follows: CH = {Article, Journal, Subject, Area, All}; AllH = All; and րH ={ (Article,
Journal), (Journal, Area), (Area, All), (Article, Subject), (Subject, All)}. 2
A dimension D is a tuple (HD , ED , CatD ,