=Paper= {{Paper |id=Vol-28/paper-7 |storemode=property |title=Automated dimensionality reduction of data warehouses |pdfUrl=https://ceur-ws.org/Vol-28/paper7.pdf |volume=Vol-28 |dblpUrl=https://dblp.org/rec/conf/dmdw/LastM00 }} ==Automated dimensionality reduction of data warehouses== https://ceur-ws.org/Vol-28/paper7.pdf
            Automated Dimensionality Reduction of Data Warehouses
                  Mark Last                                                                             Oded Maimon
Department of Computer Science and Engineering,                                               Department of Industrial Engineering
          University of South Florida                                                         Tel-Aviv University, Tel Aviv 69978
       4202 E. Fowler Avenue, ENB 118                                                                        Israel
            Tampa, FL 33620 USA                                                                     maimon@eng.tau.ac.il
              mlast@csee.usf.edu
                                                                                       we assume that these KPI’s depend on some non-
                                                                                       key attributes (called classifying attributes) in data
                                                                                       warehouse relations (Schouten, 1999). The form and
                                  Abstract                                             the strength of these dependencies are often the
                                                                                       subject of data analysis on a data warehouse.
A data warehouse is designed to consolidate and
maintain all attributes that are relevant for the                                      Since the operational systems and the data
analysis processes. Due to the rapid increase in the                                   warehouses are built for different purposes (see
size of the modern operational systems, it becomes                                     Inmon, 1994), some attributes that are essential to
neither practical, nor necessary to load and maintain                                  the operational system, may be completely
in the data warehouse every operational attribute.                                     irrelevant to the performance measures of a
This paper presents a novel methodology for                                            company and thus excluded from its data
automated selection of the most relevant                                               warehouse. This fact is emphasized by several
independent attributes in a data warehouse. The                                        authors (like Gupta, 1997), but usually their
method is based on the information-theoretic                                           assumption is that the data warehouse designers are
approach to knowledge discovery in databases.                                          knowledgeable enough to choose the right set of
Attributes are selected by a stepwise forward                                          attributes. Though this assumption may be correct
procedure aimed at minimizing the uncertainty in                                       to certain extent in most data warehousing projects,
the values of key performance indicators (KPI’s).                                      the process of reducing the warehouse
Each selected attribute is assigned a score,                                           dimensionality, can be supported by an automated
expressing its degree of relevance. Using the                                          feature selection procedure that can quickly
method does not require any prior expertise in the                                     examine numerous potential dependencies, leading
domain of the data and it can be equally applied to                                    to automatic elimination of all irrelevant and
nominal and ordinal attributes. An attribute will be                                   redundant attributes.
included in a data warehouse schema, if it is found
as relevant to at least one KPI. We demonstrate the                                    There is another advantage in reducing the
applicability of the method by reducing the                                            warehouse dimensionality. According to (Elder and
dimensionality of a direct marketing database.                                         Pregibon, 1996), large number of attributes
                                                                                       constitutes a seriously obstacle to efficiency of most
1           Introduction                                                               data mining algorithms, which may be applied to
                                                                                       the detail data in a data warehouse. Such popular
A data warehouse is defined by Inmon (1994) as a                                       methods as k-nearest neighbors, decision trees, and
“subject-oriented, integrated, time-variant, and non-                                  neural networks do not scale well in the presence of
volatile collection of data in support of                                              numerous features. Moreover, some algorithms
management’s decision making.” Each                                                    may be confused by irrelevant or noisy attributes
organization has its own key performance indicators                                    and construct poor prediction models. A successful
(KPI), which are used by management to monitor                                         choice of features provided to a data mining tool
the organization’s performance. Thus, a                                                can increase its accuracy, save the computation
manufacturing company may measure its                                                  time, and simplify its results.
performance by throughput and cost, a KPI of a
                                                                                       John et al. (1994) distinguishes between two models
service company is the mean time to handle a
                                                                                       of selecting a “good” set of features under some
service call, etc. When designing a data warehouse,
                                                                                       objective function. The feature filter model
                                                                                       assumes filtering the features before applying a data
                                                                                       mining algorithm, while the wrapper model uses the
The copyright of this paper belongs to the paper’s authors. Permission to copy         data mining algorithm itself to evaluate the features.
without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage.
                                                                                       The possible search strategies in the space of feature
                                                                                       subsets include backward elimination and forward
Proceedings of the International Workshop on Design and
                                                                                       selection. The performance criterion of the wrapper
Management of Data Warehouses (DMDW'2000)
                                                                                       model in (John et al., 1994) is the prediction
Stockholm, Sweden, June 5-6, 2000
(M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/


M. Last, O. Maimon                                                                                                                        7-1
accuracy of a data mining algorithm, estimated by          knowledge discovery procedure introduced by us in
n-fold cross validation.                                   (Maimon, Kandel, and Last, 1999). The procedure
                                                           integrates feature selection with a highly scalable
A new book on feature selection by Liu and Motoda
                                                           data mining algorithm, leading to elimination of
(1998) suggests a unified model of the feature
                                                           both irrelevant and redundant features. The method
selection process. Their model includes four parts:        description is followed by a case study of feature
feature generation, feature evaluation, stopping
                                                           selection in a direct marketing database. We
criteria, and testing. In addition to the “classic”
                                                           conclude the paper with discussing the potential
evaluation measures (accuracy, information,                enhancements of the proposed approach.
distance, and dependence) that can be used for
removing irrelevant features, they mention
important consistency measures (e.g., inconsistency
                                                           2      Information-Theoretic Method of
rate), required to find a minimum set of relevant                 Feature Selection
features. By decreasing the inconsistency rate of          The method selects features by constructing
data, both irrelevant and redundant features are           information-theoretic connectionist networks, which
removed. However, as indicated by Liu and                  represent interactions between the classifying
Motoda (1998), consistency measures are only               features and each dependent attribute (a key
suitable for selecting discrete features.                  performance indicator). The method is based on the
An enhanced greedy algorithm, based on the                 extended relational data model, described in sub-
                                                           section 2.1. In sub-section 2.2, we present the main
wrapper model, is presented by Caruana and Freitag
                                                           steps of the feature selection procedure. Extraction
(1994). Again, the metric used is the generalization
performance of the learning algorithm (its accuracy        of functional dependencies from the constructed
                                                           networks is covered by sub-section 2.3. Finally, in
over the validation data set), which increases the
                                                           sub-section 2.4, we evaluate the computational
computation time of the entire process.
                                                           complexity of the algorithm.
An information-theoretic method for selecting
relevant features is presented by Almuallim and            2.1    Extended Relational Data Model
Dietterich (1992). In their Mutual-Information-
Greedy (MIG) Algorithm defined for Boolean                 We use the following standard notation of the
noise-free features, the feature is selected if it leads   relational data model (see Korth and Silberschatz,
to the minimum conditional entropy of the                  1991):
classification attribute. Since the data is assumed        1) R - a relation schema including N attributes (N
being noise-free, no significance testing is required         ≥ 2). A relation is a part of the operational
(any non-zero entropy is significant). The above              database.
assumptions leave the MIG algorithm at quite a
distance from most practical problems of reducing          2) Ai - an attribute No. i. R = (A1, ..., AN).
data warehouse dimensionality.                             3) Di - the domain of an attribute Ai. We assume
Kira and Rendell (1992) have suggested an efficient           that each domain is a set of Mi discrete values.
feature selection algorithm, called Relief, which             ∀i: Mi ≥ 2, finite. For numeric attributes, the
evaluates each attribute by its ability to distinguish        domain is a set of adjacent intervals. The
among instances that are near each other. Their               discretization of classifying continuous
selection criterion, the feature relevance, is                attributes is performed automatically in the
applicable to numeric and nominal attributes. The             process of feature selection (see next sub-
greatest limitation of Relief is its inability to             section). Continuous KPI’s may be discretized
identify redundant features within a set of relevant          manually into pre-defined intervals.
features. Consequently, the set of features selected       4) Vij- a value No. j of domain Di. Consequently,
by Relief may not be optimal.                                 D i= (Vi1,..., ViMi). For discretized numeric
To sum-up this section, the backward elimination              attributes, each value represents an interval
strategy is very inefficient for reducing                     between two continuous values.
dimensionality of real-world data warehouses,              5) r - a relation instance (table) of the relation
which may have hundreds and thousands of original             schema R.
attributes. On the other hand, the forward selection
wrapper methods are highly expensive in terms of           6) n - number of tuples (records) in a relation r
the computational effort. The filter algorithms are           (n≥2).
computationally cheaper, but they usually fail to          7) tk[Ai] - value of an attribute No. i in a tuple
remove all redundant features. In the next section,           (record) No. k. ∀k,i: tk[Ai] ∈ Di.
we are describing the information-theoretic method
of feature selection in data warehouses, based on the


M. Last, O. Maimon                                                                                              7-2
To find the set of classifying attributes in a relation   Step 3 - Enter minimum significance level for
of the operational database, we make the following        splitting a network node (default = 0.1%). This
partition of the relation schema:                         significance level is used by the likelihood ratio test
                                                          (see Step 4.2.1.2.3 below).
1) O - a subset of target (dependent) attributes (O
   ⊂ R, |O| ≥1). These attributes represent the key       Step 4 - Repeat for every target attribute (KPI) i:
   performance indicators (KPI’s) of an
                                                          Step 4.1 - Initialize the information-theoretic
   organization. Such attributes are referred as
                                                          network (one hidden layer including the root node
   facts in the data warehouse logical model
   (Inmon, 1994). Each connectionist network is           associated with all tuples, no input attributes, one
                                                          target layer for values of the target attribute). A
   aimed at predicting the values of a KPI, based
                                                          new network is built for every target attribute. An
   on the values of input attributes (see below).
                                                          example of the initial network structure for a three-
2) C - a subset of candidate input attributes (C ⊂        valued target attribute is shown in Figure 1.
   R, |C| ≥ 1). This is a subset of attributes
   (features), which can be related to the target
   attributes.                                                                                                  1

3) Ii - a subset of input attributes (classifying          0
   features) selected by the algorithm as related to                                                            2
   the target attribute i (∀i: Ii ⊂ C). These
                                                                                                                3
   attributes are going to be loaded as dimensions
   into the data warehouse.                               Layer No. 0                Connection            Target
                                                          (the root node)
Assumptions:                                                                         Weights               Layer
1)     ∀i : I i ∩ O = ∅ (An attribute cannot be both
      an input and a target).                               Figure 1: Information-Theoretic
                                                            Connectionist Network - Initial Structure
2)     ∀i : I i ∪ O ⊆ R (Some attributes may be
      neither input, nor target). For example, the        Step 4.2 - Repeat for the maximum number of
      attributes used as primary keys in the              hidden layers (default = number of candidate input
      operational database (like a social security        attributes).
      number) may be completely useless for a data
                                                          Step 4.2.1 - Repeat for every candidate input
      warehouse, which has primary keys of its own.
                                                          attribute i’ which is still not an input attribute:
      Sometimes, (e.g., in health care) the identifying
      attributes are removed from the data warehouse      Step 4.2.1.1 - Initialize to zero the degrees of
      to preserve the privacy of the historic records.    freedom, the estimated conditional mutual
                                                          information, and the likelihood-ratio statistic of the
2.2      Feature Selection Procedure                      candidate input attribute and the target attribute,
                                                          given the final hidden layer of nodes. Conditional
The main steps of the feature selection procedure         mutual information (Cover, 1991) measures the net
are given below.                                          decrease in the entropy (uncertainty) of the
Step1 - Obtain the relation schema (name, type, and       dependent attribute due to adding information about
domain size of each attribute) and the schema             each new classifying attribute. If the target is
partition into a subset of candidate input and a          completely independent of an attribute, the
subset of target attribute (see the extended relational   conditional mutual information, given that attribute
model in sub-section 2.1 above).                          is zero.
Step 2 - Read the relation tuples (records) from the      Step 4.2.1.2 - Repeat for every node of the final
operational database. Tuples with illegal or missing      hidden layer:
target values are ignored by the algorithm.               Step 4.2.1.2.1 - Calculate the estimated conditional
Step 2.1- Encode missing values of candidate input        mutual information of the candidate input attribute
attributes in a pre-determined form.                      and the target attribute, given the node.
Step 2.2- Discretize each continuous attribute by         Step 4.2.1.2.2 - Calculate the likelihood-ratio
maximizing its mutual information with the target         statistic of the candidate input attribute and the
attribute. Mutual information (see Cover, 1991) is        target attribute, given the node.
defined as a decrease in the uncertainty of one           Step 4.2.1.2.3 - If the likelihood-ratio statistic is
attribute, given the value of another attribute.          significant, mark the node as “splitted” and
                                                          increment the conditional mutual information of the



M. Last, O. Maimon                                                                                                  7-3
candidate input attribute and the target attribute,         Details of the network construction procedure for a
given the final hidden layer of nodes; else mark the        single target attribute are provided in (Maimon et
node as “unsplitted”.                                       al., 1999). The application of the algorithm to
                                                            design of data warehouses is presented here for the
Step 4.2.1.2.4 - Go to next node.
                                                            first time.
Step 4.2.1.3 - Go to next candidate input attribute.                                            1,1

Step 4.2.2 - Find the candidate input attribute                                  1                               1
                                                                                                1,2
maximizing the estimated conditional mutual                                       2
information. According to the information theory,                 0
                                                                                                 3,1
                                                                                                                 2
this attribute is the best predictor of the target, given                         3
                                                            Layer No. 0
the values of the other input attributes.                   (the root node)                                      3
                                                                                                 3,2
Step 4.2.3 - If the maximum estimated conditional                        Layer No. 1                Connection   Target
                                                                         (First input
                                                                                      Layer No. 2 Weights
mutual information is greater than zero:                                                                         Layer
                                                                                      (Second input
                                                                         attribute)
•   Make the best candidate attribute an input                                        attribute)
    attribute.
•   Define a new layer of hidden nodes for a                 Figure 2: Information-Theoretic Connectionist
    Cartesian product of splitted hidden nodes of            Network - Two-Layered Structure
    the previous layer and values of the best
    candidate attribute.
•   Record the tuples associated with every node of
                                                            2.3       Extracting Functional Dependencies
    the new layer (a new node is defined if there is
    at least one tuple associated with it).                 Each connection between a terminal (unsplitted /
                                                            final layer) node and a target node in the
Else stop the search and output the subset Ii of input
                                                            information-theoretic network represents an
attributes (classifying features) associated with the
                                                            association rule between a conjunction of input
target attribute i.
                                                            attribute-values and a target value. Due to the
Step 4.3 - Go to next target attribute.                     inherent noisiness of the real-world data, this rule
                                                            can be considered a weak (probabilistic) functional
Step 5 - Define the set I of selected attributes            dependency as opposed to deterministic functional
(dimensions) as the union of sets of input attributes       dependencies of primary and alternate keys
with respect to every target attribute by:                  (Schouten, 1999). In the relational model (see
                                                            Korth and Silberschatz, 1991), a functional
             |O|                                            dependency between sets of attributes X and Y
        I =  Ii                                            means that the values of the Y component of a tuple
             i =1                                           are determined by the values of the X component.
                                                            On the contrast, an association rule between X and Y
                                                            has a more limited meaning: given values of X, we
Step 6 - End.                                               can estimate the probability distribution of Y.
In Figure 2, a structure of a two-layered network           An information-theoretic weight of each rule is
(based on two selected input attributes) is shown.          given by:
The first input attribute has three values, represented
by nodes no. 1,2, and 3 in the first layer, but only
nodes no. 1 and 3 are splitted due to the statistical                                         P(Vij / z )
significance testing in Step 4.2.1.2 above. The                   w zij= P (Vij ; z ) • log
                                                                                               P(Vij )
second layer has four nodes standing for the
combinations of two values of the second input
attribute with two splitted nodes of the first layer.
Like in Figure 2, the target attribute has three            Where
values, represented by three nodes in the target            P(Vij;z) - an estimated joint probability of the target
layer.                                                      value Vij and the node z.
                                                            P (Vij/ z) - an estimated conditional (a posteriori)
                                                            probability of the target value Vij, given the node z.
                                                            P(Vij) - an estimated unconditional (a priori)
                                                            probability of the target value Vij.



M. Last, O. Maimon                                                                                               7-4
The connection weights express the mutual                of tuples (n). In most cases the number of nodes
information between hidden and target nodes. A           will be much smaller than n, due to tuples having
connection weight is positive if the conditional         identical values and the statistical significance
probability of a target attribute value, given the       requirement of the likelihood-ratio test.
node, is higher than its unconditional probability
and negative otherwise. A weight close to zero           The calculation of MI (Ai’ ; Ai / z) is performed at
                                                         each hidden layer of every target attribute for all
means that the target attribute value is almost
                                                         candidate input attributes at that layer. The
independent of the node value. This means that
each rule having a positive connection weight can        summation members of MI (Ai’ ; Ai / z) refer to a
                                                         Cartesian product of values of a candidate input
be interpreted as if node, then target value.
                                                         attribute and a target attribute. The number of
Accordingly, a negative weight refers to a rule of
the form if node, then not target value.                 hidden layers is equal to p|C|. This implies that the
                                                         total number of calculations is bounded by:
The most informative rules can be found by ranking
the information-theoretic connection weights (w zij)                p|C|
                                                               | O | •∑ n • M T • M C • (| C | −m) ≤
in decreasing order. Both the rules having the                      m =0
highest positive and the lowest negative weights are           | O | •n • M T • M C • | C | 2 • p • (2 − p)
of potential interest to a user. The sum of                                        2
connection weights at all unsplitted and final layer
nodes is equal to the estimated mutual information
between a set of input attributes and a target           Thus, the feature selection algorithm can be applied
attribute (see the definition of mutual information in   to large-scale databases in a time, which is linear in
Cover, 1991). According to the well-known Pareto         the number of records and quadratic polynomial in
principle, a small number of informative rules are       the number of candidate input attributes. It is also
expected to explain a major part of the total mutual     directly proportional to the number of target
information, which agrees with the experimental          attributes (key performance indicators).
results presented in the next section.
                                                         3        Case Study: Direct Marketing
2.4     Computational Complexity                                  Database
To calculate the computational complexity of the
                                                         3.1      Background and Objectives
feature selection procedure, we are using the
following notation:                                      The original source of data is the Paralyzed
                                                         Veterans of America (PVA), a non-profit
n-       total number of tuples in a training data set
                                                         organization that provides programs and services for
|C| -    total number of candidate input attributes      US veterans with spinal cord injuries or disease.
                                                         With an in-house database of over 13 million
p-       portion of significant input attributes,        donors, PVA is also one of the largest direct mail
selected by the search procedure                         fundraisers in the US. The data set presents the
m-      number of hidden layers (input attributes),      results of one of PVA's recent fund raising appeals.
m ≤ |C|                                                  This mailing was sent to 3.5 million PVA donors
                                                         who were on the PVA database as of June 1997.
|O| -    total number of target attributes               Everyone included in this mailing had donated at
MC -      maximum domain size of a candidate input       least once to PVA.
attribute                                                One group that is of particular interest to PVA is
MT -     maximum domain size of a target attribute       "Lapsed" donors. These individuals donated to
                                                         PVA 13 to 24 months ago. They represent an
The computational “bottleneck” of the algorithm is       important group to PVA, since the longer someone
calculating the estimated conditional mutual             goes without donating, the less likely they will be to
information MI (Ai’ ; Ai / z) of the candidate input     give again. Therefore, recapture of these former
attribute Ai’ and the target attribute Ai, given a       donors is a critical aspect of PVA's fund raising
hidden node z. Since each node of m-th hidden            efforts. The data set to be analyzed includes all
layer represents a conjunction of values of m input      lapsed donors, who received the mailing
attributes, the total number of nodes at a layer No. m   (responders and non-responders). The total dollar
is apparently bounded by (Mc)m. However, we              amount of gift is given for each responder. The
restrict defining a new node (see step 4.2.3 above)      attributes extracted from the database can be used
by the requirement that there is at least one tuple      for understanding the behavior of both the most and
associated with it. Thus, the total number of nodes      the least profitable individuals. This important
at any hidden layer cannot exceed the total number



M. Last, O. Maimon                                                                                            7-5
insight may lead to more successful promotion              •     Attribute decoding. The original attributes,
campaigns in the future.                                         presenting donor promotion history and status,
                                                                 contain codes, where each byte has a different
The dataset of lapsed donors, extracted from the
                                                                 meaning (e.g., recency, frequency and amount
PVA database, is publicly available on the UCI
                                                                 of donation). These codes have been decoded
KDD Archive [http://kdd.ics.uci.edu] as KDD Cup                  by splitting each encoded attribute into several
1998 Data. It has been originally used for the
                                                                 separate attributes. The decoding operation has
Second International Knowledge Discovery and
                                                                 increased the total number of attributes from
Data Mining Tools Competition in 1998.                           481 to 518.
3.2     Database Characteristics                           •     Missing values. Missing values have been
                                                                 replaced with the mode (the most frequent
The special characteristics of the Direct Marketing              value of an attribute). As recommended by the
database include the following:                                  data documentation, attributes containing 99.5
•     Dimensionality. The data set to be used                    and more missing values have been omitted
      contains 95,412 tuples, including 5.1 % (4,843             from the analysis.
      cases) of responders to the mailing. Each tuple      •     Rare values. All values of nominal attributes
      has 481 attributes, namely one key, 478 input              that occur less than 100 times in the data set
      and 2 target attributes.                                   have been encoded as “Other”.
•     Input Attributes. There are 76 character             •     Transformation by division. To decrease the
      (nominal) and 402 numeric (continuous)                     number of distinct values for large scale
      attributes. These attributes include donor                 continuous attributes, the values of some
      demographic data (as collected by PVA and                  attributes have been divided by a constant
      third-party data sources), the promotion /                 factor (10, 100, etc.) and then rounded off to
      donation history, and the characteristics of               the nearest integer number.
      donors neighborhood, as collected from the
      1990 US Census. For privacy reasons, no              •     Discretization of Target Attribute. The
      identifying data (like the donor name, address,            continuous target attribute TARGET_D has
      etc.) has been included in the data set.                   been discretized to equal width intervals of $10
                                                                 donation each. The total number of
•     Target Attributes. There are two target                    discretization intervals has been 20, covering
      attributes in each tuple: the binary indicator for         the attribute range between $0 and $200.
      response (the attribute TARGET_B) and the
      dollar amount of the donation (the attribute         •     Discretization of Input Attributes. Numeric
      TARGET_D). Naturally, the donation amount                  attributes have been discretized by using the
      is zero for all non-responders. Both can be                significance level of 99%. For 191 attributes
      considered as key performance indicators                   (out of 404), no statistically significant partition
      (KPI’s) for the fund raising organization.                 has been found and these attributes have been
                                                                 omitted from the further stages of the analysis.
•     Data Quality. As indicated in the database                 The remaining 213 attributes have been left as
      documentation, some of the fields in the                   candidate input attributes.
      analysis file may contain data entry and/or
      formatting errors.                                   3.4     Feature Selection
•     Missing Values. Most input attributes contain a
                                                           The process of feature selection in the Direct
      certain amount of missing values, which should
                                                           Marketing database requires building separate
      be inferred from known values at the pre-            networks for two KPI’s: TARGET_B (the binary
      processing stage.
                                                           indicator for response) and TARGET_D (the dollar
                                                           amount of the donation). However, when we have
3.3     Data Pre-processing                                applied the dimensionality reduction procedure of
The pre-processing tasks included the following:           sub-section 2.2 above to the first target attribute
                                                           (TARGET_B), no significant input attributes have
                                                           been found. In other words, no candidate input
                                                           attribute, presenting in the database, is relevant to
                                                           the fact of somebody responding to the mailing.
                                                           Thus, we proceed with the results of the
                                                           information-theoretic network built for the second
                                                           target attribute (TARGET_D) only.




M. Last, O. Maimon                                                                                               7-6
The dimensionality reduction procedure of sub-                 As you can see from the description column, the
section 2.2 above has been applied to all tuples of            first four attributes selected represent the person’s
responders to the raising appeal (4,843). The                  donation history, while the fifth significant input
algorithm has selected five significant input                  attribute characterizes the donor neighborhood. The
attributes, which are about 1% of the original                 most significant attribute LASTGIFT contributes
candidate input attributes (478). Thus, the resulting          alone about 90% of total mutual information. From
information-theoretic network includes five hidden             the viewpoint of interestingness, the last attribute
layers only. The selected attributes and their                 (TPE13) seems to be the most unexpected one. It is
information-theoretic scores are presented in Table            rather unreasonable that an average human expert in
1 below.                                                       donations and direct mailing would pick this
                                                               attribute out of the list of more than 500 attributes.
Table 1 shows three information-theoretic measures
of association between the input attributes and the            If the PVA organization is designing a data
target attribute: Mutual Information, Conditional              warehouse for supporting its fundraising activities,
Mutual Information, and Conditional Entropy. All               the results of the feature selection algorithm can
these parameters are based on the notion of Entropy            cause a dramatic reduction in the amount of stored
(see Cover, 1991), which represents the uncertainty            data. Given that an average attribute (field) requires
of a random variable. The entropy is measured in               four bytes of memory, one donor tuple (record) of
bits. Information on input attributes, associated              478 input attributes takes about 2k bytes. Since
with the target, can decrease the uncertainty and the          PVA has a donor database of about 13 million
resulting entropy of the target.                               records, dimensionality reduction of 99% means
                                                               saving about 25 GB of computer storage at the time
The column “Mutual Information” shows the
                                                               of data creation (1997). Perhaps, PVA could be
cumulative association between a subset of input
                                                               interested in storing additional information in its
attributes, selected up to a given iteration                   data warehouse, based on its business expertise, but
inclusively, and the target attribute. The next
                                                               still the above results suggest that the majority of
column, “Conditional MI (Mutual Information)”
                                                               operational data is not needed for loading into the
shows the net decrease in the entropy of the target            data warehouse: that data is either redundant, or
attribute “Target_D” due to adding each input
                                                               completely irrelevant to PVA’s fundraising
attribute. The last column (“Conditional Entropy”)
                                                               performance.
is equal to the difference between the unconditional
entropy of Target_D (1.916 bits) and the estimated
mutual information.
Table 1: Direct marketing database – dimensionality reduction procedure
Iteration Attribute Attribute      Attribute                     Mutual        Conditional   Conditional
         Number     Name           Description                   Information   MI            Entropy
1        318        LASTGIFT       Dollar amount of most         0.6976        0.6976        1.2188
                                   recent gift
2        316        MAXRAMNT Dollar amount of largest gift       0.7599        0.0624        1.1564
                             to date
3        314        MINRAMNT       Dollar amount of smallest gift 0.7671       0.0072        1.1492
                                   to date
4        319        LASTDATE       Date associated with the most 0.7711        0.004         1.1452
                                   recent gift
5        233        TPE13          Percent Traveling 15 - 59     0.7737        0.0025        1.1427
                                   Minutes to Work




M. Last, O. Maimon                                                                                                7-7
                                                             network. Most rules in this table indicate that there
3.5    Functional Dependencies                               is a direct relationship between the donation history
                                                             and the actual amount of donation (TARGET_D).
The connections between input and target nodes in            Thus, rule no. 2 says that if the last gift is between
the information-theoretic network, constructed in            $20.5 and $28, then the actual donation is expected
the previous sub-section, have been used to extract          to be between $20 and $29 (almost in the same
disjunctive association rules between the significant        range). A similar interpretation can be given to rule
input attributes and the target attribute (Target_D).        no. 3: those who were poor donors for the last time
The total of 935 positive and 85 negative rules have         are expected to preserve their behavior. Rule no. 1
been extracted. The rules have been scored by the            is a slight but interesting exception: it says that
information-theoretic weights of their connections           those who donated about $20 for the last time are
(see sub-section 2.3 above).                                 expected to increase their donation by as much as
In Table 2 below, we are presenting only the rules           50%.
having the highest connection weights in the


Table 2: Direct marketing database – highest positive connection weights


Rule LASTGIFT        MAXRAMNT MINRAMNT TARGET_D weight                Cum. Weight      Cum. Percent
1     $19 - $20.25                               $20 - $29   0.1164   0.1164           11.9%
2     $20.5 - $28                                $20 - $29   0.0859   0.2023           20.7%
3     $1 - $5                                    $0 - $9     0.0841   0.2864           29.3%
4     $12 - $14                                  $10 - $19   0.0596   0.346            35.4%
5     $1 - $5        $7 - $9                     $0 - $9     0.0534   0.3994           40.9%
6     $1 - $5        $10 - $11                   $0 - $9     0.0402   0.4396           45.0%
7     $6 - $7        $7 - $9                     $0 - $9     0.0392   0.4788           49.0%
8     $10 - $11      $10 - $11                   $10 - $19   0.0377   0.5165           52.9%
9     $20.5 - $28                                $30 - $39   0.0307   0.5472           56.0%
10    $15 - $15.5    $15 - $15.5   $4.68 - $8.99 $10 - $19   0.0293   0.5765           59.0%




M. Last, O. Maimon                                                                                              7-8
                                                        J.F. Elder IV and D. Pregibon, A Statistical
                                                                 Perspective on Knowledge Discovery in
                                                                 Databases, In Advances in Knowledge
4      Conclusions                                               Discovery and Data Mining, U. Fayyad, G.
The paper presents a method for automated                        Piatetsky-Shapiro, P. Smyth, and R.
selection of attributes in a data warehouse. The                 Uthurusamy, Eds. AAAI/MIT Press,
method is based on the information-theoretic                     Menlo Park, CA, pp. 83-113, 1996.
approach to knowledge discovery in databases
(Maimon et al., 1999). It does not require any prior    V. R. Gupta, An Introduction to Data
knowledge about the business domain, can                        Warehousing, System Services
eliminate both irrelevant and redundant features,               Corporation, Chicago, Illinois, 1997.
and is applicable to any type of data (nominal,
numeric, etc.). While no automated procedure can        W.H. Inmon and R.D. Hackathorn, Using the
fully replace a human expert, the decisions of data            Data Warehouse, John Wiley & Sons, New
warehouse designers can certainly be supported by a            York, 1994.
feature selection system presented here. As shown
by the direct marketing example, integrating the        G. H. John, R. Kohavi, and K. Pfleger, Irrelevant
automated feature selection in the design process               Features and the Subset Selection
can lead to a significant dimensionality reduction of           Problem, Proc. of the 11th Int'l Conf. on
data warehouse.                                                 Machine Learning, pages 121-129,
                                                                Morgan Kaufmann, 1994.
Developing automated approaches to the design of
data warehouses requires further investigation of
several issues. For example, automated methods          K. Kira and L.A. Rendell, The Feature Selection
need to be developed for determining the set of                 Problem: Traditional Methods and a New
transformations to be applied to the original                   Algorithm, Proc. of AAAI'92, pages 129-
attributes. A cost associated with each attribute               134, 1992.
(e.g., cost of third-party data) can be integrated in
the feature selection procedure. The data model         H.F. Korth and A. Silberschatz, Database System
(see sub-section 2.1 above) can be further extended            Concepts, McGraw-Hill, Inc., New York,
to include candidate input attributes from multiple            1991.
relations (via foreign keys). Another important
problem is detecting dynamic changes in the weak        H. Liu and H. Motoda, Feature Selection for
functional dependencies and updating the data                   Knowledge Discovery and Data Mining,
warehouse structure accordingly.                                Kluwer, Boston, 1998.

References                                              O. Maimon, A. Kandel, and M. Last, Information-
                                                               Theoretic Fuzzy Approach to Knowledge
H. Almuallim and T. G. Dietterich, Efficient                   Discovery in Databases. In Advances in
       Algorithms for Identifying Relevant                     Soft Computing - Engineering Design and
       Features, Proc. of 9th Canadian Conf. on                Manufacturing, R. Roy, T. Furuhashi and
       AI, pages 38-45, Morgan Kaufmann, 1992.                 P.K. Chawdhry, Eds. Springer-Verlag,
                                                               London, pp. 315-326, 1999.
S.D. Bay, The UCI KDD Archive
        [http://kdd.ics.uci.edu], Irvine, CA:           H. Schouten, Analysis and Design of Data
        University of California, Department of                Warehouses, Proc. International Workshop
        Information and Computer Science,1999.                 on Design and Management of Data
                                                               Warehouses (DMDW’99), Heidelberg,
R. Caruana and D. Freitag, Greedy Attribute                    Germany, 1999.
       Selection, Proc. of 11th Conf. On
       Machine Learning, pages 28-36, Morgan
       Kaufmann, 1994.

T. M. Cover, Elements of Information Theory,
        Wiley, New York, 1991.




M. Last, O. Maimon                                                                                      7-9