=Paper=
{{Paper
|id=Vol-28/paper-7
|storemode=property
|title=Automated dimensionality reduction of data warehouses
|pdfUrl=https://ceur-ws.org/Vol-28/paper7.pdf
|volume=Vol-28
|dblpUrl=https://dblp.org/rec/conf/dmdw/LastM00
}}
==Automated dimensionality reduction of data warehouses==
Automated Dimensionality Reduction of Data Warehouses
Mark Last Oded Maimon
Department of Computer Science and Engineering, Department of Industrial Engineering
University of South Florida Tel-Aviv University, Tel Aviv 69978
4202 E. Fowler Avenue, ENB 118 Israel
Tampa, FL 33620 USA maimon@eng.tau.ac.il
mlast@csee.usf.edu
we assume that these KPI’s depend on some non-
key attributes (called classifying attributes) in data
warehouse relations (Schouten, 1999). The form and
Abstract the strength of these dependencies are often the
subject of data analysis on a data warehouse.
A data warehouse is designed to consolidate and
maintain all attributes that are relevant for the Since the operational systems and the data
analysis processes. Due to the rapid increase in the warehouses are built for different purposes (see
size of the modern operational systems, it becomes Inmon, 1994), some attributes that are essential to
neither practical, nor necessary to load and maintain the operational system, may be completely
in the data warehouse every operational attribute. irrelevant to the performance measures of a
This paper presents a novel methodology for company and thus excluded from its data
automated selection of the most relevant warehouse. This fact is emphasized by several
independent attributes in a data warehouse. The authors (like Gupta, 1997), but usually their
method is based on the information-theoretic assumption is that the data warehouse designers are
approach to knowledge discovery in databases. knowledgeable enough to choose the right set of
Attributes are selected by a stepwise forward attributes. Though this assumption may be correct
procedure aimed at minimizing the uncertainty in to certain extent in most data warehousing projects,
the values of key performance indicators (KPI’s). the process of reducing the warehouse
Each selected attribute is assigned a score, dimensionality, can be supported by an automated
expressing its degree of relevance. Using the feature selection procedure that can quickly
method does not require any prior expertise in the examine numerous potential dependencies, leading
domain of the data and it can be equally applied to to automatic elimination of all irrelevant and
nominal and ordinal attributes. An attribute will be redundant attributes.
included in a data warehouse schema, if it is found
as relevant to at least one KPI. We demonstrate the There is another advantage in reducing the
applicability of the method by reducing the warehouse dimensionality. According to (Elder and
dimensionality of a direct marketing database. Pregibon, 1996), large number of attributes
constitutes a seriously obstacle to efficiency of most
1 Introduction data mining algorithms, which may be applied to
the detail data in a data warehouse. Such popular
A data warehouse is defined by Inmon (1994) as a methods as k-nearest neighbors, decision trees, and
“subject-oriented, integrated, time-variant, and non- neural networks do not scale well in the presence of
volatile collection of data in support of numerous features. Moreover, some algorithms
management’s decision making.” Each may be confused by irrelevant or noisy attributes
organization has its own key performance indicators and construct poor prediction models. A successful
(KPI), which are used by management to monitor choice of features provided to a data mining tool
the organization’s performance. Thus, a can increase its accuracy, save the computation
manufacturing company may measure its time, and simplify its results.
performance by throughput and cost, a KPI of a
John et al. (1994) distinguishes between two models
service company is the mean time to handle a
of selecting a “good” set of features under some
service call, etc. When designing a data warehouse,
objective function. The feature filter model
assumes filtering the features before applying a data
mining algorithm, while the wrapper model uses the
The copyright of this paper belongs to the paper’s authors. Permission to copy data mining algorithm itself to evaluate the features.
without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage.
The possible search strategies in the space of feature
subsets include backward elimination and forward
Proceedings of the International Workshop on Design and
selection. The performance criterion of the wrapper
Management of Data Warehouses (DMDW'2000)
model in (John et al., 1994) is the prediction
Stockholm, Sweden, June 5-6, 2000
(M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/
M. Last, O. Maimon 7-1
accuracy of a data mining algorithm, estimated by knowledge discovery procedure introduced by us in
n-fold cross validation. (Maimon, Kandel, and Last, 1999). The procedure
integrates feature selection with a highly scalable
A new book on feature selection by Liu and Motoda
data mining algorithm, leading to elimination of
(1998) suggests a unified model of the feature
both irrelevant and redundant features. The method
selection process. Their model includes four parts: description is followed by a case study of feature
feature generation, feature evaluation, stopping
selection in a direct marketing database. We
criteria, and testing. In addition to the “classic”
conclude the paper with discussing the potential
evaluation measures (accuracy, information, enhancements of the proposed approach.
distance, and dependence) that can be used for
removing irrelevant features, they mention
important consistency measures (e.g., inconsistency
2 Information-Theoretic Method of
rate), required to find a minimum set of relevant Feature Selection
features. By decreasing the inconsistency rate of The method selects features by constructing
data, both irrelevant and redundant features are information-theoretic connectionist networks, which
removed. However, as indicated by Liu and represent interactions between the classifying
Motoda (1998), consistency measures are only features and each dependent attribute (a key
suitable for selecting discrete features. performance indicator). The method is based on the
An enhanced greedy algorithm, based on the extended relational data model, described in sub-
section 2.1. In sub-section 2.2, we present the main
wrapper model, is presented by Caruana and Freitag
steps of the feature selection procedure. Extraction
(1994). Again, the metric used is the generalization
performance of the learning algorithm (its accuracy of functional dependencies from the constructed
networks is covered by sub-section 2.3. Finally, in
over the validation data set), which increases the
sub-section 2.4, we evaluate the computational
computation time of the entire process.
complexity of the algorithm.
An information-theoretic method for selecting
relevant features is presented by Almuallim and 2.1 Extended Relational Data Model
Dietterich (1992). In their Mutual-Information-
Greedy (MIG) Algorithm defined for Boolean We use the following standard notation of the
noise-free features, the feature is selected if it leads relational data model (see Korth and Silberschatz,
to the minimum conditional entropy of the 1991):
classification attribute. Since the data is assumed 1) R - a relation schema including N attributes (N
being noise-free, no significance testing is required ≥ 2). A relation is a part of the operational
(any non-zero entropy is significant). The above database.
assumptions leave the MIG algorithm at quite a
distance from most practical problems of reducing 2) Ai - an attribute No. i. R = (A1, ..., AN).
data warehouse dimensionality. 3) Di - the domain of an attribute Ai. We assume
Kira and Rendell (1992) have suggested an efficient that each domain is a set of Mi discrete values.
feature selection algorithm, called Relief, which ∀i: Mi ≥ 2, finite. For numeric attributes, the
evaluates each attribute by its ability to distinguish domain is a set of adjacent intervals. The
among instances that are near each other. Their discretization of classifying continuous
selection criterion, the feature relevance, is attributes is performed automatically in the
applicable to numeric and nominal attributes. The process of feature selection (see next sub-
greatest limitation of Relief is its inability to section). Continuous KPI’s may be discretized
identify redundant features within a set of relevant manually into pre-defined intervals.
features. Consequently, the set of features selected 4) Vij- a value No. j of domain Di. Consequently,
by Relief may not be optimal. D i= (Vi1,..., ViMi). For discretized numeric
To sum-up this section, the backward elimination attributes, each value represents an interval
strategy is very inefficient for reducing between two continuous values.
dimensionality of real-world data warehouses, 5) r - a relation instance (table) of the relation
which may have hundreds and thousands of original schema R.
attributes. On the other hand, the forward selection
wrapper methods are highly expensive in terms of 6) n - number of tuples (records) in a relation r
the computational effort. The filter algorithms are (n≥2).
computationally cheaper, but they usually fail to 7) tk[Ai] - value of an attribute No. i in a tuple
remove all redundant features. In the next section, (record) No. k. ∀k,i: tk[Ai] ∈ Di.
we are describing the information-theoretic method
of feature selection in data warehouses, based on the
M. Last, O. Maimon 7-2
To find the set of classifying attributes in a relation Step 3 - Enter minimum significance level for
of the operational database, we make the following splitting a network node (default = 0.1%). This
partition of the relation schema: significance level is used by the likelihood ratio test
(see Step 4.2.1.2.3 below).
1) O - a subset of target (dependent) attributes (O
⊂ R, |O| ≥1). These attributes represent the key Step 4 - Repeat for every target attribute (KPI) i:
performance indicators (KPI’s) of an
Step 4.1 - Initialize the information-theoretic
organization. Such attributes are referred as
network (one hidden layer including the root node
facts in the data warehouse logical model
(Inmon, 1994). Each connectionist network is associated with all tuples, no input attributes, one
target layer for values of the target attribute). A
aimed at predicting the values of a KPI, based
new network is built for every target attribute. An
on the values of input attributes (see below).
example of the initial network structure for a three-
2) C - a subset of candidate input attributes (C ⊂ valued target attribute is shown in Figure 1.
R, |C| ≥ 1). This is a subset of attributes
(features), which can be related to the target
attributes. 1
3) Ii - a subset of input attributes (classifying 0
features) selected by the algorithm as related to 2
the target attribute i (∀i: Ii ⊂ C). These
3
attributes are going to be loaded as dimensions
into the data warehouse. Layer No. 0 Connection Target
(the root node)
Assumptions: Weights Layer
1) ∀i : I i ∩ O = ∅ (An attribute cannot be both
an input and a target). Figure 1: Information-Theoretic
Connectionist Network - Initial Structure
2) ∀i : I i ∪ O ⊆ R (Some attributes may be
neither input, nor target). For example, the Step 4.2 - Repeat for the maximum number of
attributes used as primary keys in the hidden layers (default = number of candidate input
operational database (like a social security attributes).
number) may be completely useless for a data
Step 4.2.1 - Repeat for every candidate input
warehouse, which has primary keys of its own.
attribute i’ which is still not an input attribute:
Sometimes, (e.g., in health care) the identifying
attributes are removed from the data warehouse Step 4.2.1.1 - Initialize to zero the degrees of
to preserve the privacy of the historic records. freedom, the estimated conditional mutual
information, and the likelihood-ratio statistic of the
2.2 Feature Selection Procedure candidate input attribute and the target attribute,
given the final hidden layer of nodes. Conditional
The main steps of the feature selection procedure mutual information (Cover, 1991) measures the net
are given below. decrease in the entropy (uncertainty) of the
Step1 - Obtain the relation schema (name, type, and dependent attribute due to adding information about
domain size of each attribute) and the schema each new classifying attribute. If the target is
partition into a subset of candidate input and a completely independent of an attribute, the
subset of target attribute (see the extended relational conditional mutual information, given that attribute
model in sub-section 2.1 above). is zero.
Step 2 - Read the relation tuples (records) from the Step 4.2.1.2 - Repeat for every node of the final
operational database. Tuples with illegal or missing hidden layer:
target values are ignored by the algorithm. Step 4.2.1.2.1 - Calculate the estimated conditional
Step 2.1- Encode missing values of candidate input mutual information of the candidate input attribute
attributes in a pre-determined form. and the target attribute, given the node.
Step 2.2- Discretize each continuous attribute by Step 4.2.1.2.2 - Calculate the likelihood-ratio
maximizing its mutual information with the target statistic of the candidate input attribute and the
attribute. Mutual information (see Cover, 1991) is target attribute, given the node.
defined as a decrease in the uncertainty of one Step 4.2.1.2.3 - If the likelihood-ratio statistic is
attribute, given the value of another attribute. significant, mark the node as “splitted” and
increment the conditional mutual information of the
M. Last, O. Maimon 7-3
candidate input attribute and the target attribute, Details of the network construction procedure for a
given the final hidden layer of nodes; else mark the single target attribute are provided in (Maimon et
node as “unsplitted”. al., 1999). The application of the algorithm to
design of data warehouses is presented here for the
Step 4.2.1.2.4 - Go to next node.
first time.
Step 4.2.1.3 - Go to next candidate input attribute. 1,1
Step 4.2.2 - Find the candidate input attribute 1 1
1,2
maximizing the estimated conditional mutual 2
information. According to the information theory, 0
3,1
2
this attribute is the best predictor of the target, given 3
Layer No. 0
the values of the other input attributes. (the root node) 3
3,2
Step 4.2.3 - If the maximum estimated conditional Layer No. 1 Connection Target
(First input
Layer No. 2 Weights
mutual information is greater than zero: Layer
(Second input
attribute)
• Make the best candidate attribute an input attribute)
attribute.
• Define a new layer of hidden nodes for a Figure 2: Information-Theoretic Connectionist
Cartesian product of splitted hidden nodes of Network - Two-Layered Structure
the previous layer and values of the best
candidate attribute.
• Record the tuples associated with every node of
2.3 Extracting Functional Dependencies
the new layer (a new node is defined if there is
at least one tuple associated with it). Each connection between a terminal (unsplitted /
final layer) node and a target node in the
Else stop the search and output the subset Ii of input
information-theoretic network represents an
attributes (classifying features) associated with the
association rule between a conjunction of input
target attribute i.
attribute-values and a target value. Due to the
Step 4.3 - Go to next target attribute. inherent noisiness of the real-world data, this rule
can be considered a weak (probabilistic) functional
Step 5 - Define the set I of selected attributes dependency as opposed to deterministic functional
(dimensions) as the union of sets of input attributes dependencies of primary and alternate keys
with respect to every target attribute by: (Schouten, 1999). In the relational model (see
Korth and Silberschatz, 1991), a functional
|O| dependency between sets of attributes X and Y
I = Ii means that the values of the Y component of a tuple
i =1 are determined by the values of the X component.
On the contrast, an association rule between X and Y
has a more limited meaning: given values of X, we
Step 6 - End. can estimate the probability distribution of Y.
In Figure 2, a structure of a two-layered network An information-theoretic weight of each rule is
(based on two selected input attributes) is shown. given by:
The first input attribute has three values, represented
by nodes no. 1,2, and 3 in the first layer, but only
nodes no. 1 and 3 are splitted due to the statistical P(Vij / z )
significance testing in Step 4.2.1.2 above. The w zij= P (Vij ; z ) • log
P(Vij )
second layer has four nodes standing for the
combinations of two values of the second input
attribute with two splitted nodes of the first layer.
Like in Figure 2, the target attribute has three Where
values, represented by three nodes in the target P(Vij;z) - an estimated joint probability of the target
layer. value Vij and the node z.
P (Vij/ z) - an estimated conditional (a posteriori)
probability of the target value Vij, given the node z.
P(Vij) - an estimated unconditional (a priori)
probability of the target value Vij.
M. Last, O. Maimon 7-4
The connection weights express the mutual of tuples (n). In most cases the number of nodes
information between hidden and target nodes. A will be much smaller than n, due to tuples having
connection weight is positive if the conditional identical values and the statistical significance
probability of a target attribute value, given the requirement of the likelihood-ratio test.
node, is higher than its unconditional probability
and negative otherwise. A weight close to zero The calculation of MI (Ai’ ; Ai / z) is performed at
each hidden layer of every target attribute for all
means that the target attribute value is almost
candidate input attributes at that layer. The
independent of the node value. This means that
each rule having a positive connection weight can summation members of MI (Ai’ ; Ai / z) refer to a
Cartesian product of values of a candidate input
be interpreted as if node, then target value.
attribute and a target attribute. The number of
Accordingly, a negative weight refers to a rule of
the form if node, then not target value. hidden layers is equal to p|C|. This implies that the
total number of calculations is bounded by:
The most informative rules can be found by ranking
the information-theoretic connection weights (w zij) p|C|
| O | •∑ n • M T • M C • (| C | −m) ≤
in decreasing order. Both the rules having the m =0
highest positive and the lowest negative weights are | O | •n • M T • M C • | C | 2 • p • (2 − p)
of potential interest to a user. The sum of 2
connection weights at all unsplitted and final layer
nodes is equal to the estimated mutual information
between a set of input attributes and a target Thus, the feature selection algorithm can be applied
attribute (see the definition of mutual information in to large-scale databases in a time, which is linear in
Cover, 1991). According to the well-known Pareto the number of records and quadratic polynomial in
principle, a small number of informative rules are the number of candidate input attributes. It is also
expected to explain a major part of the total mutual directly proportional to the number of target
information, which agrees with the experimental attributes (key performance indicators).
results presented in the next section.
3 Case Study: Direct Marketing
2.4 Computational Complexity Database
To calculate the computational complexity of the
3.1 Background and Objectives
feature selection procedure, we are using the
following notation: The original source of data is the Paralyzed
Veterans of America (PVA), a non-profit
n- total number of tuples in a training data set
organization that provides programs and services for
|C| - total number of candidate input attributes US veterans with spinal cord injuries or disease.
With an in-house database of over 13 million
p- portion of significant input attributes, donors, PVA is also one of the largest direct mail
selected by the search procedure fundraisers in the US. The data set presents the
m- number of hidden layers (input attributes), results of one of PVA's recent fund raising appeals.
m ≤ |C| This mailing was sent to 3.5 million PVA donors
who were on the PVA database as of June 1997.
|O| - total number of target attributes Everyone included in this mailing had donated at
MC - maximum domain size of a candidate input least once to PVA.
attribute One group that is of particular interest to PVA is
MT - maximum domain size of a target attribute "Lapsed" donors. These individuals donated to
PVA 13 to 24 months ago. They represent an
The computational “bottleneck” of the algorithm is important group to PVA, since the longer someone
calculating the estimated conditional mutual goes without donating, the less likely they will be to
information MI (Ai’ ; Ai / z) of the candidate input give again. Therefore, recapture of these former
attribute Ai’ and the target attribute Ai, given a donors is a critical aspect of PVA's fund raising
hidden node z. Since each node of m-th hidden efforts. The data set to be analyzed includes all
layer represents a conjunction of values of m input lapsed donors, who received the mailing
attributes, the total number of nodes at a layer No. m (responders and non-responders). The total dollar
is apparently bounded by (Mc)m. However, we amount of gift is given for each responder. The
restrict defining a new node (see step 4.2.3 above) attributes extracted from the database can be used
by the requirement that there is at least one tuple for understanding the behavior of both the most and
associated with it. Thus, the total number of nodes the least profitable individuals. This important
at any hidden layer cannot exceed the total number
M. Last, O. Maimon 7-5
insight may lead to more successful promotion • Attribute decoding. The original attributes,
campaigns in the future. presenting donor promotion history and status,
contain codes, where each byte has a different
The dataset of lapsed donors, extracted from the
meaning (e.g., recency, frequency and amount
PVA database, is publicly available on the UCI
of donation). These codes have been decoded
KDD Archive [http://kdd.ics.uci.edu] as KDD Cup by splitting each encoded attribute into several
1998 Data. It has been originally used for the
separate attributes. The decoding operation has
Second International Knowledge Discovery and
increased the total number of attributes from
Data Mining Tools Competition in 1998. 481 to 518.
3.2 Database Characteristics • Missing values. Missing values have been
replaced with the mode (the most frequent
The special characteristics of the Direct Marketing value of an attribute). As recommended by the
database include the following: data documentation, attributes containing 99.5
• Dimensionality. The data set to be used and more missing values have been omitted
contains 95,412 tuples, including 5.1 % (4,843 from the analysis.
cases) of responders to the mailing. Each tuple • Rare values. All values of nominal attributes
has 481 attributes, namely one key, 478 input that occur less than 100 times in the data set
and 2 target attributes. have been encoded as “Other”.
• Input Attributes. There are 76 character • Transformation by division. To decrease the
(nominal) and 402 numeric (continuous) number of distinct values for large scale
attributes. These attributes include donor continuous attributes, the values of some
demographic data (as collected by PVA and attributes have been divided by a constant
third-party data sources), the promotion / factor (10, 100, etc.) and then rounded off to
donation history, and the characteristics of the nearest integer number.
donors neighborhood, as collected from the
1990 US Census. For privacy reasons, no • Discretization of Target Attribute. The
identifying data (like the donor name, address, continuous target attribute TARGET_D has
etc.) has been included in the data set. been discretized to equal width intervals of $10
donation each. The total number of
• Target Attributes. There are two target discretization intervals has been 20, covering
attributes in each tuple: the binary indicator for the attribute range between $0 and $200.
response (the attribute TARGET_B) and the
dollar amount of the donation (the attribute • Discretization of Input Attributes. Numeric
TARGET_D). Naturally, the donation amount attributes have been discretized by using the
is zero for all non-responders. Both can be significance level of 99%. For 191 attributes
considered as key performance indicators (out of 404), no statistically significant partition
(KPI’s) for the fund raising organization. has been found and these attributes have been
omitted from the further stages of the analysis.
• Data Quality. As indicated in the database The remaining 213 attributes have been left as
documentation, some of the fields in the candidate input attributes.
analysis file may contain data entry and/or
formatting errors. 3.4 Feature Selection
• Missing Values. Most input attributes contain a
The process of feature selection in the Direct
certain amount of missing values, which should
Marketing database requires building separate
be inferred from known values at the pre- networks for two KPI’s: TARGET_B (the binary
processing stage.
indicator for response) and TARGET_D (the dollar
amount of the donation). However, when we have
3.3 Data Pre-processing applied the dimensionality reduction procedure of
The pre-processing tasks included the following: sub-section 2.2 above to the first target attribute
(TARGET_B), no significant input attributes have
been found. In other words, no candidate input
attribute, presenting in the database, is relevant to
the fact of somebody responding to the mailing.
Thus, we proceed with the results of the
information-theoretic network built for the second
target attribute (TARGET_D) only.
M. Last, O. Maimon 7-6
The dimensionality reduction procedure of sub- As you can see from the description column, the
section 2.2 above has been applied to all tuples of first four attributes selected represent the person’s
responders to the raising appeal (4,843). The donation history, while the fifth significant input
algorithm has selected five significant input attribute characterizes the donor neighborhood. The
attributes, which are about 1% of the original most significant attribute LASTGIFT contributes
candidate input attributes (478). Thus, the resulting alone about 90% of total mutual information. From
information-theoretic network includes five hidden the viewpoint of interestingness, the last attribute
layers only. The selected attributes and their (TPE13) seems to be the most unexpected one. It is
information-theoretic scores are presented in Table rather unreasonable that an average human expert in
1 below. donations and direct mailing would pick this
attribute out of the list of more than 500 attributes.
Table 1 shows three information-theoretic measures
of association between the input attributes and the If the PVA organization is designing a data
target attribute: Mutual Information, Conditional warehouse for supporting its fundraising activities,
Mutual Information, and Conditional Entropy. All the results of the feature selection algorithm can
these parameters are based on the notion of Entropy cause a dramatic reduction in the amount of stored
(see Cover, 1991), which represents the uncertainty data. Given that an average attribute (field) requires
of a random variable. The entropy is measured in four bytes of memory, one donor tuple (record) of
bits. Information on input attributes, associated 478 input attributes takes about 2k bytes. Since
with the target, can decrease the uncertainty and the PVA has a donor database of about 13 million
resulting entropy of the target. records, dimensionality reduction of 99% means
saving about 25 GB of computer storage at the time
The column “Mutual Information” shows the
of data creation (1997). Perhaps, PVA could be
cumulative association between a subset of input
interested in storing additional information in its
attributes, selected up to a given iteration data warehouse, based on its business expertise, but
inclusively, and the target attribute. The next
still the above results suggest that the majority of
column, “Conditional MI (Mutual Information)”
operational data is not needed for loading into the
shows the net decrease in the entropy of the target data warehouse: that data is either redundant, or
attribute “Target_D” due to adding each input
completely irrelevant to PVA’s fundraising
attribute. The last column (“Conditional Entropy”)
performance.
is equal to the difference between the unconditional
entropy of Target_D (1.916 bits) and the estimated
mutual information.
Table 1: Direct marketing database – dimensionality reduction procedure
Iteration Attribute Attribute Attribute Mutual Conditional Conditional
Number Name Description Information MI Entropy
1 318 LASTGIFT Dollar amount of most 0.6976 0.6976 1.2188
recent gift
2 316 MAXRAMNT Dollar amount of largest gift 0.7599 0.0624 1.1564
to date
3 314 MINRAMNT Dollar amount of smallest gift 0.7671 0.0072 1.1492
to date
4 319 LASTDATE Date associated with the most 0.7711 0.004 1.1452
recent gift
5 233 TPE13 Percent Traveling 15 - 59 0.7737 0.0025 1.1427
Minutes to Work
M. Last, O. Maimon 7-7
network. Most rules in this table indicate that there
3.5 Functional Dependencies is a direct relationship between the donation history
and the actual amount of donation (TARGET_D).
The connections between input and target nodes in Thus, rule no. 2 says that if the last gift is between
the information-theoretic network, constructed in $20.5 and $28, then the actual donation is expected
the previous sub-section, have been used to extract to be between $20 and $29 (almost in the same
disjunctive association rules between the significant range). A similar interpretation can be given to rule
input attributes and the target attribute (Target_D). no. 3: those who were poor donors for the last time
The total of 935 positive and 85 negative rules have are expected to preserve their behavior. Rule no. 1
been extracted. The rules have been scored by the is a slight but interesting exception: it says that
information-theoretic weights of their connections those who donated about $20 for the last time are
(see sub-section 2.3 above). expected to increase their donation by as much as
In Table 2 below, we are presenting only the rules 50%.
having the highest connection weights in the
Table 2: Direct marketing database – highest positive connection weights
Rule LASTGIFT MAXRAMNT MINRAMNT TARGET_D weight Cum. Weight Cum. Percent
1 $19 - $20.25 $20 - $29 0.1164 0.1164 11.9%
2 $20.5 - $28 $20 - $29 0.0859 0.2023 20.7%
3 $1 - $5 $0 - $9 0.0841 0.2864 29.3%
4 $12 - $14 $10 - $19 0.0596 0.346 35.4%
5 $1 - $5 $7 - $9 $0 - $9 0.0534 0.3994 40.9%
6 $1 - $5 $10 - $11 $0 - $9 0.0402 0.4396 45.0%
7 $6 - $7 $7 - $9 $0 - $9 0.0392 0.4788 49.0%
8 $10 - $11 $10 - $11 $10 - $19 0.0377 0.5165 52.9%
9 $20.5 - $28 $30 - $39 0.0307 0.5472 56.0%
10 $15 - $15.5 $15 - $15.5 $4.68 - $8.99 $10 - $19 0.0293 0.5765 59.0%
M. Last, O. Maimon 7-8
J.F. Elder IV and D. Pregibon, A Statistical
Perspective on Knowledge Discovery in
Databases, In Advances in Knowledge
4 Conclusions Discovery and Data Mining, U. Fayyad, G.
The paper presents a method for automated Piatetsky-Shapiro, P. Smyth, and R.
selection of attributes in a data warehouse. The Uthurusamy, Eds. AAAI/MIT Press,
method is based on the information-theoretic Menlo Park, CA, pp. 83-113, 1996.
approach to knowledge discovery in databases
(Maimon et al., 1999). It does not require any prior V. R. Gupta, An Introduction to Data
knowledge about the business domain, can Warehousing, System Services
eliminate both irrelevant and redundant features, Corporation, Chicago, Illinois, 1997.
and is applicable to any type of data (nominal,
numeric, etc.). While no automated procedure can W.H. Inmon and R.D. Hackathorn, Using the
fully replace a human expert, the decisions of data Data Warehouse, John Wiley & Sons, New
warehouse designers can certainly be supported by a York, 1994.
feature selection system presented here. As shown
by the direct marketing example, integrating the G. H. John, R. Kohavi, and K. Pfleger, Irrelevant
automated feature selection in the design process Features and the Subset Selection
can lead to a significant dimensionality reduction of Problem, Proc. of the 11th Int'l Conf. on
data warehouse. Machine Learning, pages 121-129,
Morgan Kaufmann, 1994.
Developing automated approaches to the design of
data warehouses requires further investigation of
several issues. For example, automated methods K. Kira and L.A. Rendell, The Feature Selection
need to be developed for determining the set of Problem: Traditional Methods and a New
transformations to be applied to the original Algorithm, Proc. of AAAI'92, pages 129-
attributes. A cost associated with each attribute 134, 1992.
(e.g., cost of third-party data) can be integrated in
the feature selection procedure. The data model H.F. Korth and A. Silberschatz, Database System
(see sub-section 2.1 above) can be further extended Concepts, McGraw-Hill, Inc., New York,
to include candidate input attributes from multiple 1991.
relations (via foreign keys). Another important
problem is detecting dynamic changes in the weak H. Liu and H. Motoda, Feature Selection for
functional dependencies and updating the data Knowledge Discovery and Data Mining,
warehouse structure accordingly. Kluwer, Boston, 1998.
References O. Maimon, A. Kandel, and M. Last, Information-
Theoretic Fuzzy Approach to Knowledge
H. Almuallim and T. G. Dietterich, Efficient Discovery in Databases. In Advances in
Algorithms for Identifying Relevant Soft Computing - Engineering Design and
Features, Proc. of 9th Canadian Conf. on Manufacturing, R. Roy, T. Furuhashi and
AI, pages 38-45, Morgan Kaufmann, 1992. P.K. Chawdhry, Eds. Springer-Verlag,
London, pp. 315-326, 1999.
S.D. Bay, The UCI KDD Archive
[http://kdd.ics.uci.edu], Irvine, CA: H. Schouten, Analysis and Design of Data
University of California, Department of Warehouses, Proc. International Workshop
Information and Computer Science,1999. on Design and Management of Data
Warehouses (DMDW’99), Heidelberg,
R. Caruana and D. Freitag, Greedy Attribute Germany, 1999.
Selection, Proc. of 11th Conf. On
Machine Learning, pages 28-36, Morgan
Kaufmann, 1994.
T. M. Cover, Elements of Information Theory,
Wiley, New York, 1991.
M. Last, O. Maimon 7-9