<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automated Dimensionality Reduction of Data Warehouses</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mark Last</string-name>
          <email>mlast@csee.usf.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oded Maimon</string-name>
          <email>maimon@eng.tau.ac.il</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, University of South Florida</institution>
          ,
          <addr-line>4202 E. Fowler Avenue, ENB 118, Tampa, FL 33620</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Industrial Engineering, Tel-Aviv University</institution>
          ,
          <addr-line>Tel Aviv 69978</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
      </contrib-group>
      <fpage>7</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>A data warehouse is designed to consolidate and maintain all attributes that are relevant for the analysis processes. Due to the rapid increase in the size of the modern operational systems, it becomes neither practical, nor necessary to load and maintain in the data warehouse every operational attribute. This paper presents a novel methodology for automated selection of the most relevant independent attributes in a data warehouse. The method is based on the information-theoretic approach to knowledge discovery in databases. Attributes are selected by a stepwise forward procedure aimed at minimizing the uncertainty in the values of key performance indicators (KPI's). Each selected attribute is assigned a score, expressing its degree of relevance. Using the method does not require any prior expertise in the domain of the data and it can be equally applied to nominal and ordinal attributes. An attribute will be included in a data warehouse schema, if it is found as relevant to at least one KPI. We demonstrate the applicability of the method by reducing the dimensionality of a direct marketing database.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A data wa
        <xref ref-type="bibr" rid="ref3">rehouse is defined by Inmon (1994</xref>
        ) as a
“subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of
management’s decision making.” Each
organization has its own key performance indicators
(KPI), which are used by management to monitor
the organization’s performance. Thus, a
manufacturing company may measure its
performance by throughput and cost, a KPI of a
service company is the mean time to handle a
service call, etc. When designing a data warehouse,
      </p>
      <p>
        John et al. (1994) distinguishes between two models
of selecting a “good” set of features under some
objective function. The feature filter model
assumes filtering the features before applying a data
mining algorithm, while the wrapper model uses the
data mining algorithm itself to evaluate the features.
The possible search strategies in the space of feature
subsets include backward elimination and forward
selection. The performance criterion of the w
        <xref ref-type="bibr" rid="ref3">rapper
model in (John et al., 1994</xref>
        ) is the prediction
accuracy of a data mining algorithm, estimated by
n-fold cross validation.
      </p>
      <p>A new book on feature selection by Liu and Motoda
(1998) suggests a unified model of the feature
selection process. Their model includes four parts:
feature generation, feature evaluation, stopping
criteria, and testing. In addition to the “classic”
evaluation measures (accuracy, information,
distance, and dependence) that can be used for
removing irrelevant features, they mention
important consistency measures (e.g., inconsistency
rate), required to find a minimum set of relevant
features. By decreasing the inconsistency rate of
data, both irrelevant and redundant features are
removed. However, as indicated by Liu and
Motoda (1998), consistency measures are only
suitable for selecting discrete features.</p>
      <p>
        An enhanced greedy algorithm, based on the
wrapper model, is presented by Ca
        <xref ref-type="bibr" rid="ref3">ruana and Freitag
(1994</xref>
        ). Again, the metric used is the generalization
performance of the learning algorithm (its accuracy
over the validation data set), which increases the
computation time of the entire process.
      </p>
      <p>
        An information-theoretic method for selecting
relevant features is presented by Almuallim and
Dietteric
        <xref ref-type="bibr" rid="ref1">h (1992</xref>
        ). In their
Mutual-InformationGreedy (MIG) Algorithm defined for Boolean
noise-free features, the feature is selected if it leads
to the minimum conditional entropy of the
classification attribute. Since the data is assumed
being noise-free, no significance testing is required
(any non-zero entropy is significant). The above
assumptions leave the MIG algorithm at quite a
distance from most practical problems of reducing
data warehouse dimensionality.
      </p>
      <p>
        <xref ref-type="bibr" rid="ref9">Kira and Rendell (1992</xref>
        ) have suggested an efficient
feature selection algorithm, called Relief, which
evaluates each attribute by its ability to distinguish
among instances that are near each other. Their
selection criterion, the feature relevance, is
applicable to numeric and nominal attributes. The
greatest limitation of Relief is its inability to
identify redundant features within a set of relevant
features. Consequently, the set of features selected
by Relief may not be optimal.
      </p>
      <p>
        To sum-up this section, the backward elimination
strategy is very inefficient for reducing
dimensionality of real-world data warehouses,
which may have hundreds and thousands of original
attributes. On the other hand, the forward selection
wrapper methods are highly expensive in terms of
the computational effort. The filter algorithms are
computationally cheaper, but they usually fail to
remove all redundant features. In the next section,
we are describing the information-theoretic method
of feature selection in data warehouses, based on the
knowledge discovery procedure introduced by us in
(Maimon, Kandel, and La
        <xref ref-type="bibr" rid="ref2">st, 1999</xref>
        ). The procedure
integrates feature selection with a highly scalable
data mining algorithm, leading to elimination of
both irrelevant and redundant features. The method
description is followed by a case study of feature
selection in a direct marketing database. We
conclude the paper with discussing the potential
enhancements of the proposed approach.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Information-Theoretic Method of</title>
    </sec>
    <sec id="sec-3">
      <title>Feature Selection</title>
      <p>The method selects features by constructing
information-theoretic connectionist networks, which
represent interactions between the classifying
features and each dependent attribute (a key
performance indicator). The method is based on the
extended relational data model, described in
subsection 2.1. In sub-section 2.2, we present the main
steps of the feature selection procedure. Extraction
of functional dependencies from the constructed
networks is covered by sub-section 2.3. Finally, in
sub-section 2.4, we evaluate the computational
complexity of the algorithm.
2.1</p>
      <sec id="sec-3-1">
        <title>Extended Relational Data Model</title>
        <p>
          We use the following standard notation of the
relational data model
          <xref ref-type="bibr" rid="ref10">(see Korth and Silberschatz,
1991)</xref>
          :
1)
2)
3)
4)
5)
6)
        </p>
        <p>R - a relation schema including N attributes (N
≥ 2). A relation is a part of the operational
database.</p>
        <p>Ai - an attribute No. i. R = (A1, ..., AN).</p>
        <p>Di - the domain of an attribute Ai. We assume
that each domain is a set of Mi discrete values.
∀i: Mi ≥ 2, finite. For numeric attributes, the
domain is a set of adjacent intervals. The
discretization of classifying continuous
attributes is performed automatically in the
process of feature selection (see next
subsection). Continuous KPI’s may be discretized
manually into pre-defined intervals.</p>
        <p>Vij- a value No. j of domain Di. Consequently,
D i= (Vi1,..., ViMi). For discretized numeric
attributes, each value represents an interval
between two continuous values.
r - a relation instance (table) of the relation
schema R.
n - number of tuples (records) in a relation r
(n≥2).
7) tk[Ai] - value of an attribute No. i in a tuple
(record) No. k. ∀k,i: tk[Ai] ∈ Di.
To find the set of classifying attributes in a relation
of the operational database, we make the following
partition of the relation schema:</p>
        <p>
          O - a subset of target (dependent) attributes (O
⊂ R, |O| ≥1). These attributes represent the key
performance indicators (KPI’s) of an
organization. Such attributes are referred as
facts in the data wa
          <xref ref-type="bibr" rid="ref3">rehouse logical model
(Inmon, 1994</xref>
          ). Each connectionist network is
aimed at predicting the values of a KPI, based
on the values of input attributes (see below).
C - a subset of candidate input attributes (C ⊂
R, |C| ≥ 1). This is a subset of attributes
(features), which can be related to the target
attributes.
3) Ii - a subset of input attributes (classifying
features) selected by the algorithm as related to
the target attribute i (∀i: Ii ⊂ C). These
attributes are going to be loaded as dimensions
into the data warehouse.
        </p>
        <p>Assumptions:
∀i : I i ∩ O = ∅ (An attribute cannot be both
an input and a target).
∀i : I i ∪ O ⊆ R (Some attributes may be
neither input, nor target). For example, the
attributes used as primary keys in the
operational database (like a social security
number) may be completely useless for a data
warehouse, which has primary keys of its own.
Sometimes, (e.g., in health care) the identifying
attributes are removed from the data warehouse
to preserve the privacy of the historic records.
1)
2)
1)
2)
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Feature Selection Procedure</title>
        <p>The main steps of the feature selection procedure
are given below.</p>
        <p>Step1 - Obtain the relation schema (name, type, and
domain size of each attribute) and the schema
partition into a subset of candidate input and a
subset of target attribute (see the extended relational
model in sub-section 2.1 above).</p>
        <p>Step 2 - Read the relation tuples (records) from the
operational database. Tuples with illegal or missing
target values are ignored by the algorithm.
Step 2.1- Encode missing values of candidate input
attributes in a pre-determined form.</p>
        <p>
          Step 2.2- Discretize each continuous attribute by
maximizing its mutual information with the target
attribute. Mutual information
          <xref ref-type="bibr" rid="ref4">(see Cover, 1991)</xref>
          is
defined as a decrease in the uncertainty of one
attribute, given the value of another attribute.
Step 3 - Enter minimum significance level for
splitting a network node (default = 0.1%). This
significance level is used by the likelihood ratio test
(see Step 4.2.1.2.3 below).
        </p>
        <p>
          Step 4 - Repeat for every target attribute (KPI) i:
Step 4.1 - Initialize the information-theoretic
network (one hidden layer including the root node
associated with all tuples, no input attributes, one
target layer for values of the target attribute). A
new network is built for every target attribute. An
example of the initial network structure for a
threevalued target attribute is shown in Figure 1.
0
Step 4.2.1.1 - Initialize to zero the degrees of
freedom, the estimated conditional mutual
information, and the likelihood-ratio statistic of the
candidate input attribute and the target attribute,
given the final hidden layer of nodes. Conditional
mutual information
          <xref ref-type="bibr" rid="ref4">(Cover, 1991)</xref>
          measures the net
decrease in the entropy (uncertainty) of the
dependent attribute due to adding information about
each new classifying attribute. If the target is
completely independent of an attribute, the
conditional mutual information, given that attribute
is zero.
        </p>
        <p>Step 4.2.1.2 - Repeat for every node of the final
hidden layer:
Step 4.2.1.2.1 - Calculate the estimated conditional
mutual information of the candidate input attribute
and the target attribute, given the node.</p>
        <p>Step 4.2.1.2.2 - Calculate the likelihood-ratio
statistic of the candidate input attribute and the
target attribute, given the node.</p>
        <p>Step 4.2.1.2.3 - If the likelihood-ratio statistic is
significant, mark the node as “splitted” and
increment the conditional mutual information of the
1
2
3
Target
Layer
•
•
•
candidate input attribute and the target attribute,
given the final hidden layer of nodes; else mark the
node as “unsplitted”.</p>
        <p>Step 4.2.1.2.4 - Go to next node.</p>
        <p>Step 4.2.1.3 - Go to next candidate input attribute.
Step 4.2.2 - Find the candidate input attribute
maximizing the estimated conditional mutual
information. According to the information theory,
this attribute is the best predictor of the target, given
the values of the other input attributes.</p>
        <p>Step 4.2.3 - If the maximum estimated conditional
mutual information is greater than zero:</p>
        <p>Make the best candidate attribute an input
attribute.</p>
        <p>Define a new layer of hidden nodes for a
Cartesian product of splitted hidden nodes of
the previous layer and values of the best
candidate attribute.</p>
        <p>Record the tuples associated with every node of
the new layer (a new node is defined if there is
at least one tuple associated with it).</p>
        <p>Else stop the search and output the subset Ii of input
attributes (classifying features) associated with the
target attribute i.</p>
        <sec id="sec-3-2-1">
          <title>Step 4.3 - Go to next target attribute. Step 5 - Define the set I of selected attributes (dimensions) as the union of sets of input attributes with respect to every target attribute by:</title>
          <p>I =
|O|
i=1</p>
          <p>I i
Step 6 - End.</p>
          <p>In Figure 2, a structure of a two-layered network
(based on two selected input attributes) is shown.
The first input attribute has three values, represented
by nodes no. 1,2, and 3 in the first layer, but only
nodes no. 1 and 3 are splitted due to the statistical
significance testing in Step 4.2.1.2 above. The
second layer has four nodes standing for the
combinations of two values of the second input
attribute with two splitted nodes of the first layer.
Like in Figure 2, the target attribute has three
values, represented by three nodes in the target
layer.</p>
          <p>
            Details of the network construction procedure for a
single target attribute are provided in
            <xref ref-type="bibr" rid="ref12">(Maimon et
al., 1999)</xref>
            . The application of the algorithm to
design of data warehouses is presented here for the
first time.
          </p>
          <p>
            0
Each connection between a terminal (unsplitted /
final layer) node and a target node in the
information-theoretic network represents an
association rule between a conjunction of input
attribute-values and a target value. Due to the
inherent noisiness of the real-world data, this rule
can be considered a weak (probabilistic) functional
dependency as opposed to deterministic functional
dependencies of primary and alternate key
            <xref ref-type="bibr" rid="ref2">s
(Schouten, 1999</xref>
            ). In the relational model
            <xref ref-type="bibr" rid="ref10">(see
Korth and Silberschatz, 1991)</xref>
            , a functional
dependency between sets of attributes X and Y
means that the values of the Y component of a tuple
are determined by the values of the X component.
On the contrast, an association rule between X and Y
has a more limited meaning: given values of X, we
can estimate the probability distribution of Y.
An information-theoretic weight of each rule is
given by:
w zij= P(Vij ; z) • log
          </p>
          <p>P(Vij / z)</p>
          <p>P(Vij )</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Where</title>
          <p>P(Vij;z) - an estimated joint probability of the target
value Vij and the node z.</p>
          <p>P (Vij/ z) - an estimated conditional (a posteriori)
probability of the target value Vij, given the node z.
P(Vij) - an estimated unconditional (a priori)
probability of the target value Vij.
7-4
The connection weights express the mutual
information between hidden and target nodes. A
connection weight is positive if the conditional
probability of a target attribute value, given the
node, is higher than its unconditional probability
and negative otherwise. A weight close to zero
means that the target attribute value is almost
independent of the node value. This means that
each rule having a positive connection weight can
be interpreted as if node, then target value.
Accordingly, a negative weight refers to a rule of
the form if node, then not target value.</p>
          <p>
            The most informative rules can be found by ranking
the information-theoretic connection weights (w zij)
in decreasing order. Both the rules having the
highest positive and the lowest negative weights are
of potential interest to a user. The sum of
connection weights at all unsplitted and final layer
nodes is equal to the estimated mutual information
between a set of input attributes and a target
attribute
            <xref ref-type="bibr" rid="ref4">(see the definition of mutual information in
Cover, 1991)</xref>
            . According to the well-known Pareto
principle, a small number of informative rules are
expected to explain a major part of the total mutual
information, which agrees with the experimental
results presented in the next section.
2.4
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Computational Complexity</title>
        <p>To calculate the computational complexity of the
feature selection procedure, we are using the
following notation:
n
|C|
total number of tuples in a training data set
total number of candidate input attributes
p - portion of significant input attributes,
selected by the search procedure
number of hidden layers (input attributes),
total number of target attributes
maximum domain size of a candidate input
m
m ≤ |C|
|O|
MC
attribute
MT</p>
        <p>maximum domain size of a target attribute
The computational “bottleneck” of the algorithm is
calculating the estimated conditional mutual
information MI (Ai’ ; Ai / z) of the candidate input
attribute Ai’ and the target attribute Ai, given a
hidden node z. Since each node of m-th hidden
layer represents a conjunction of values of m input
attributes, the total number of nodes at a layer No. m
is apparently bounded by (Mc)m. However, we
restrict defining a new node (see step 4.2.3 above)
by the requirement that there is at least one tuple
associated with it. Thus, the total number of nodes
at any hidden layer cannot exceed the total number
of tuples (n). In most cases the number of nodes
will be much smaller than n, due to tuples having
identical values and the statistical significance
requirement of the likelihood-ratio test.</p>
        <p>The calculation of MI (Ai’ ; Ai / z) is performed at
each hidden layer of every target attribute for all
candidate input attributes at that layer. The
summation members of MI (Ai’ ; Ai / z) refer to a
Cartesian product of values of a candidate input
attribute and a target attribute. The number of
hidden layers is equal to p|C|. This implies that the
total number of calculations is bounded by:
p|C|
| O | •∑n • MT • MC • (| C | −m) ≤</p>
        <p>m=0
| O | •n • MT • MC •| C |2 •p • (2 − p)</p>
        <p>2
Thus, the feature selection algorithm can be applied
to large-scale databases in a time, which is linear in
the number of records and quadratic polynomial in
the number of candidate input attributes. It is also
directly proportional to the number of target
attributes (key performance indicators).
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Case Study: Direct Marketing</title>
    </sec>
    <sec id="sec-5">
      <title>Database</title>
      <sec id="sec-5-1">
        <title>Background and Objectives</title>
        <p>
          The original source of data is the Paralyzed
Veterans of America (PVA), a non-profit
organization that provides programs and services for
US veterans with spinal cord injuries or disease.
With an in-house database of over 13 million
donors, PVA is also one of the largest direct mail
fundraisers in the US. The data set presents the
results of one of PVA's recent fund raising appeals.
This mailing was sent to 3.5 million PVA donors
who were on the P
          <xref ref-type="bibr" rid="ref6">VA database as of June 1997</xref>
          .
Everyone included in this mailing had donated at
least once to PVA.
        </p>
        <p>One group that is of particular interest to PVA is
"Lapsed" donors. These individuals donated to
PVA 13 to 24 months ago. They represent an
important group to PVA, since the longer someone
goes without donating, the less likely they will be to
give again. Therefore, recapture of these former
donors is a critical aspect of PVA's fund raising
efforts. The data set to be analyzed includes all
lapsed donors, who received the mailing
(responders and non-responders). The total dollar
amount of gift is given for each responder. The
attributes extracted from the database can be used
for understanding the behavior of both the most and
the least profitable individuals. This important
insight may lead to more successful promotion
campaigns in the future.</p>
        <p>The dataset of lapsed donors, extracted from the
PVA database, is publicly available on the UCI
KDD Archive [http://kdd.ics.uci.edu] as KDD Cup
1998 Data. It has been originally used for the
Second International Knowledge Discovery and
Data Mining Tools Competition in 1998.
3.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Database Characteristics</title>
        <p>The special characteristics of the Direct Marketing
database include the following:</p>
        <p>Dimensionality. The data set to be used
contains 95,412 tuples, including 5.1 % (4,843
cases) of responders to the mailing. Each tuple
has 481 attributes, namely one key, 478 input
and 2 target attributes.</p>
        <p>Input Attributes. There are 76 character
(nominal) and 402 numeric (continuous)
attributes. These attributes include donor
demographic data (as collected by PVA and
third-party data sources), the promotion /
donation history, and the characteristics of
donors neighborhood, as collected from the
1990 US Census. For privacy reasons, no
identifying data (like the donor name, address,
etc.) has been included in the data set.</p>
        <p>Target Attributes. There are two target
attributes in each tuple: the binary indicator for
response (the attribute TARGET_B) and the
dollar amount of the donation (the attribute
TARGET_D). Naturally, the donation amount
is zero for all non-responders. Both can be
considered as key performance indicators
(KPI’s) for the fund raising organization.
Data Quality. As indicated in the database
documentation, some of the fields in the
analysis file may contain data entry and/or
formatting errors.</p>
        <p>Missing Values. Most input attributes contain a
certain amount of missing values, which should
be inferred from known values at the
preprocessing stage.
3.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Data Pre-processing</title>
        <p>The pre-processing tasks included the following:
•
•
•
•
•
•</p>
        <p>Attribute decoding. The original attributes,
presenting donor promotion history and status,
contain codes, where each byte has a different
meaning (e.g., recency, frequency and amount
of donation). These codes have been decoded
by splitting each encoded attribute into several
separate attributes. The decoding operation has
increased the total number of attributes from
481 to 518.</p>
        <p>Missing values. Missing values have been
replaced with the mode (the most frequent
value of an attribute). As recommended by the
data documentation, attributes containing 99.5
and more missing values have been omitted
from the analysis.</p>
        <p>Rare values. All values of nominal attributes
that occur less than 100 times in the data set
have been encoded as “Other”.</p>
        <p>Transformation by division. To decrease the
number of distinct values for large scale
continuous attributes, the values of some
attributes have been divided by a constant
factor (10, 100, etc.) and then rounded off to
the nearest integer number.</p>
        <p>Discretization of Target Attribute. The
continuous target attribute TARGET_D has
been discretized to equal width intervals of $10
donation each. The total number of
discretization intervals has been 20, covering
the attribute range between $0 and $200.
Discretization of Input Attributes. Numeric
attributes have been discretized by using the
significance level of 99%. For 191 attributes
(out of 404), no statistically significant partition
has been found and these attributes have been
omitted from the further stages of the analysis.
The remaining 213 attributes have been left as
candidate input attributes.
3.4</p>
      </sec>
      <sec id="sec-5-4">
        <title>Feature Selection</title>
        <p>The process of feature selection in the Direct
Marketing database requires building separate
networks for two KPI’s: TARGET_B (the binary
indicator for response) and TARGET_D (the dollar
amount of the donation). However, when we have
applied the dimensionality reduction procedure of
sub-section 2.2 above to the first target attribute
(TARGET_B), no significant input attributes have
been found. In other words, no candidate input
attribute, presenting in the database, is relevant to
the fact of somebody responding to the mailing.
Thus, we proceed with the results of the
information-theoretic network built for the second
target attribute (TARGET_D) only.
7-6
The dimensionality reduction procedure of
subsection 2.2 above has been applied to all tuples of
responders to the raising appeal (4,843). The
algorithm has selected five significant input
attributes, which are about 1% of the original
candidate input attributes (478). Thus, the resulting
information-theoretic network includes five hidden
layers only. The selected attributes and their
information-theoretic scores are presented in Table
1 below.</p>
        <p>
          Table 1 shows three information-theoretic measures
of association between the input attributes and the
target attribute: Mutual Information, Conditional
Mutual Information, and Conditional Entropy. All
these parameters are based on the notion of Entropy
          <xref ref-type="bibr" rid="ref4">(see Cover, 1991)</xref>
          , which represents the uncertainty
of a random variable. The entropy is measured in
bits. Information on input attributes, associated
with the target, can decrease the uncertainty and the
resulting entropy of the target.
        </p>
        <p>The column “Mutual Information” shows the
cumulative association between a subset of input
attributes, selected up to a given iteration
inclusively, and the target attribute. The next
column, “Conditional MI (Mutual Information)”
shows the net decrease in the entropy of the target
attribute “Target_D” due to adding each input
attribute. The last column (“Conditional Entropy”)
is equal to the difference between the unconditional
entropy of Target_D (1.916 bits) and the estimated
mutual information.</p>
        <p>As you can see from the description column, the
first four attributes selected represent the person’s
donation history, while the fifth significant input
attribute characterizes the donor neighborhood. The
most significant attribute LASTGIFT contributes
alone about 90% of total mutual information. From
the viewpoint of interestingness, the last attribute
(TPE13) seems to be the most unexpected one. It is
rather unreasonable that an average human expert in
donations and direct mailing would pick this
attribute out of the list of more than 500 attributes.
If the PVA organization is designing a data
warehouse for supporting its fundraising activities,
the results of the feature selection algorithm can
cause a dramatic reduction in the amount of stored
data. Given that an average attribute (field) requires
four bytes of memory, one donor tuple (record) of
478 input attributes takes about 2k bytes. Since
PVA has a donor database of about 13 million
records, dimensionality reduction of 99% means
saving about 25 GB of computer storage at the time
of data creation (1997). Perhaps, PVA could be
interested in storing additional information in its
data warehouse, based on its business expertise, but
still the above results suggest that the majority of
operational data is not needed for loading into the
data warehouse: that data is either redundant, or
completely irrelevant to PVA’s fundraising
performance.
Mutual</p>
        <p>Conditional</p>
        <p>Conditional
Information</p>
        <p>MI
0.6976
0.7599
0.6976
0.0624
0.0072
0.004
0.0025</p>
        <p>Entropy
1.2188
1.1564
1.1492
1.1452
1.1427
The connections between input and target nodes in
the information-theoretic network, constructed in
the previous sub-section, have been used to extract
disjunctive association rules between the significant
input attributes and the target attribute (Target_D).
The total of 935 positive and 85 negative rules have
been extracted. The rules have been scored by the
information-theoretic weights of their connections
(see sub-section 2.3 above).</p>
        <p>In Table 2 below, we are presenting only the rules
having the highest connection weights in the
network. Most rules in this table indicate that there
is a direct relationship between the donation history
and the actual amount of donation (TARGET_D).
Thus, rule no. 2 says that if the last gift is between
$20.5 and $28, then the actual donation is expected
to be between $20 and $29 (almost in the same
range). A similar interpretation can be given to rule
no. 3: those who were poor donors for the last time
are expected to preserve their behavior. Rule no. 1
is a slight but interesting exception: it says that
those who donated about $20 for the last time are
expected to increase their donation by as much as
50%.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>
        The paper presents a method for automated
selection of attributes in a data warehouse. The
method is based on the information-theoretic
approach to knowledge discovery in databa
        <xref ref-type="bibr" rid="ref2">ses
(Maimon et al., 1999</xref>
        ). It does not require any prior
knowledge about the business domain, can
eliminate both irrelevant and redundant features,
and is applicable to any type of data (nominal,
numeric, etc.). While no automated procedure can
fully replace a human expert, the decisions of data
warehouse designers can certainly be supported by a
feature selection system presented here. As shown
by the direct marketing example, integrating the
automated feature selection in the design process
can lead to a significant dimensionality reduction of
data warehouse.
      </p>
      <p>Developing automated approaches to the design of
data warehouses requires further investigation of
several issues. For example, automated methods
need to be developed for determining the set of
transformations to be applied to the original
attributes. A cost associated with each attribute
(e.g., cost of third-party data) can be integrated in
the feature selection procedure. The data model
(see sub-section 2.1 above) can be further extended
to include candidate input attributes from multiple
relations (via foreign keys). Another important
problem is detecting dynamic changes in the weak
functional dependencies and updating the data
warehouse structure accordingly.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Almuallim</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. G.</given-names>
            <surname>Dietterich</surname>
          </string-name>
          ,
          <article-title>Efficient Algorithms for Identifying Relevant Features</article-title>
          ,
          <source>Proc. of 9th Canadian Conf. on AI</source>
          , pages
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          , Morgan Kaufmann,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>S.D.</given-names>
            <surname>Bay</surname>
          </string-name>
          , The UCI KDD Archive [http://kdd.ics.uci.edu], Irvine, CA: University of California, Department of Information and Computer Science,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Freitag</surname>
          </string-name>
          , Greedy Attribute Selection,
          <source>Proc. of 11th Conf. On Machine Learning</source>
          , pages
          <fpage>28</fpage>
          -
          <lpage>36</lpage>
          , Morgan Kaufmann,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>T. M. Cover</surname>
          </string-name>
          , Elements of Information Theory, Wiley, New York,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>J.F. Elder</surname>
            <given-names>IV</given-names>
          </string-name>
          and
          <string-name>
            <surname>D. Pregibon</surname>
            ,
            <given-names>A Statistical</given-names>
          </string-name>
          <article-title>Perspective on Knowledge Discovery in Databases, In Advances in Knowledge Discovery</article-title>
          and
          <string-name>
            <given-names>Data</given-names>
            <surname>Mining</surname>
          </string-name>
          , U. Fayyad,
          <string-name>
            <given-names>G.</given-names>
            <surname>Piatetsky-Shapiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          , and R. Uthurusamy, Eds. AAAI/MIT Press, Menlo Park, CA, pp.
          <fpage>83</fpage>
          -
          <lpage>113</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , An Introduction to Data Warehousing,
          <source>System Services Corporation</source>
          , Chicago, Illinois,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>W.H.</given-names>
            <surname>Inmon and R.D. Hackathorn</surname>
          </string-name>
          ,
          <article-title>Using the Data Warehouse</article-title>
          , John Wiley &amp; Sons, New York,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>G. H. John</surname>
            , R. Kohavi, and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Pfleger</surname>
          </string-name>
          ,
          <article-title>Irrelevant Features and the Subset Selection Problem</article-title>
          ,
          <source>Proc. of the 11th Int'l Conf. on Machine Learning</source>
          , pages
          <fpage>121</fpage>
          -
          <lpage>129</lpage>
          , Morgan Kaufmann,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Kira</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.A.</given-names>
            <surname>Rendell</surname>
          </string-name>
          ,
          <article-title>The Feature Selection Problem: Traditional Methods and a New Algorithm</article-title>
          ,
          <source>Proc. of AAAI'92</source>
          , pages
          <fpage>129</fpage>
          -
          <lpage>134</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>H.F.</given-names>
            <surname>Korth</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Silberschatz</surname>
          </string-name>
          ,
          <source>Database System Concepts</source>
          ,
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , Inc., New York,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Motoda</surname>
          </string-name>
          ,
          <article-title>Feature Selection for Knowledge Discovery</article-title>
          and
          <string-name>
            <given-names>Data</given-names>
            <surname>Mining</surname>
          </string-name>
          , Kluwer, Boston,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>O.</given-names>
            <surname>Maimon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kandel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Last</surname>
          </string-name>
          ,
          <article-title>InformationTheoretic Fuzzy Approach to Knowledge Discovery in Databases</article-title>
          .
          <source>In Advances in Soft Computing - Engineering Design and Manufacturing</source>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Furuhashi</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.K.</given-names>
            <surname>Chawdhry</surname>
          </string-name>
          , Eds. Springer-Verlag, London, pp.
          <fpage>315</fpage>
          -
          <lpage>326</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Schouten</surname>
          </string-name>
          ,
          <article-title>Analysis and Design of Data Warehouses</article-title>
          ,
          <source>Proc. International Workshop on Design and Management of Data Warehouses (DMDW'99)</source>
          , Heidelberg, Germany,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>