Automated Dimensionality Reduction of Data Warehouses Mark Last Oded Maimon Department of Computer Science and Engineering, Department of Industrial Engineering University of South Florida Tel-Aviv University, Tel Aviv 69978 4202 E. Fowler Avenue, ENB 118 Israel Tampa, FL 33620 USA maimon@eng.tau.ac.il mlast@csee.usf.edu we assume that these KPI’s depend on some non- key attributes (called classifying attributes) in data warehouse relations (Schouten, 1999). The form and Abstract the strength of these dependencies are often the subject of data analysis on a data warehouse. A data warehouse is designed to consolidate and maintain all attributes that are relevant for the Since the operational systems and the data analysis processes. Due to the rapid increase in the warehouses are built for different purposes (see size of the modern operational systems, it becomes Inmon, 1994), some attributes that are essential to neither practical, nor necessary to load and maintain the operational system, may be completely in the data warehouse every operational attribute. irrelevant to the performance measures of a This paper presents a novel methodology for company and thus excluded from its data automated selection of the most relevant warehouse. This fact is emphasized by several independent attributes in a data warehouse. The authors (like Gupta, 1997), but usually their method is based on the information-theoretic assumption is that the data warehouse designers are approach to knowledge discovery in databases. knowledgeable enough to choose the right set of Attributes are selected by a stepwise forward attributes. Though this assumption may be correct procedure aimed at minimizing the uncertainty in to certain extent in most data warehousing projects, the values of key performance indicators (KPI’s). the process of reducing the warehouse Each selected attribute is assigned a score, dimensionality, can be supported by an automated expressing its degree of relevance. Using the feature selection procedure that can quickly method does not require any prior expertise in the examine numerous potential dependencies, leading domain of the data and it can be equally applied to to automatic elimination of all irrelevant and nominal and ordinal attributes. An attribute will be redundant attributes. included in a data warehouse schema, if it is found as relevant to at least one KPI. We demonstrate the There is another advantage in reducing the applicability of the method by reducing the warehouse dimensionality. According to (Elder and dimensionality of a direct marketing database. Pregibon, 1996), large number of attributes constitutes a seriously obstacle to efficiency of most 1 Introduction data mining algorithms, which may be applied to the detail data in a data warehouse. Such popular A data warehouse is defined by Inmon (1994) as a methods as k-nearest neighbors, decision trees, and “subject-oriented, integrated, time-variant, and non- neural networks do not scale well in the presence of volatile collection of data in support of numerous features. Moreover, some algorithms management’s decision making.” Each may be confused by irrelevant or noisy attributes organization has its own key performance indicators and construct poor prediction models. A successful (KPI), which are used by management to monitor choice of features provided to a data mining tool the organization’s performance. Thus, a can increase its accuracy, save the computation manufacturing company may measure its time, and simplify its results. performance by throughput and cost, a KPI of a John et al. (1994) distinguishes between two models service company is the mean time to handle a of selecting a “good” set of features under some service call, etc. When designing a data warehouse, objective function. The feature filter model assumes filtering the features before applying a data mining algorithm, while the wrapper model uses the The copyright of this paper belongs to the paper’s authors. Permission to copy data mining algorithm itself to evaluate the features. without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage. The possible search strategies in the space of feature subsets include backward elimination and forward Proceedings of the International Workshop on Design and selection. The performance criterion of the wrapper Management of Data Warehouses (DMDW'2000) model in (John et al., 1994) is the prediction Stockholm, Sweden, June 5-6, 2000 (M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.) http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/ M. Last, O. Maimon 7-1 accuracy of a data mining algorithm, estimated by knowledge discovery procedure introduced by us in n-fold cross validation. (Maimon, Kandel, and Last, 1999). The procedure integrates feature selection with a highly scalable A new book on feature selection by Liu and Motoda data mining algorithm, leading to elimination of (1998) suggests a unified model of the feature both irrelevant and redundant features. The method selection process. Their model includes four parts: description is followed by a case study of feature feature generation, feature evaluation, stopping selection in a direct marketing database. We criteria, and testing. In addition to the “classic” conclude the paper with discussing the potential evaluation measures (accuracy, information, enhancements of the proposed approach. distance, and dependence) that can be used for removing irrelevant features, they mention important consistency measures (e.g., inconsistency 2 Information-Theoretic Method of rate), required to find a minimum set of relevant Feature Selection features. By decreasing the inconsistency rate of The method selects features by constructing data, both irrelevant and redundant features are information-theoretic connectionist networks, which removed. However, as indicated by Liu and represent interactions between the classifying Motoda (1998), consistency measures are only features and each dependent attribute (a key suitable for selecting discrete features. performance indicator). The method is based on the An enhanced greedy algorithm, based on the extended relational data model, described in sub- section 2.1. In sub-section 2.2, we present the main wrapper model, is presented by Caruana and Freitag steps of the feature selection procedure. Extraction (1994). Again, the metric used is the generalization performance of the learning algorithm (its accuracy of functional dependencies from the constructed networks is covered by sub-section 2.3. Finally, in over the validation data set), which increases the sub-section 2.4, we evaluate the computational computation time of the entire process. complexity of the algorithm. An information-theoretic method for selecting relevant features is presented by Almuallim and 2.1 Extended Relational Data Model Dietterich (1992). In their Mutual-Information- Greedy (MIG) Algorithm defined for Boolean We use the following standard notation of the noise-free features, the feature is selected if it leads relational data model (see Korth and Silberschatz, to the minimum conditional entropy of the 1991): classification attribute. Since the data is assumed 1) R - a relation schema including N attributes (N being noise-free, no significance testing is required ≥ 2). A relation is a part of the operational (any non-zero entropy is significant). The above database. assumptions leave the MIG algorithm at quite a distance from most practical problems of reducing 2) Ai - an attribute No. i. R = (A1, ..., AN). data warehouse dimensionality. 3) Di - the domain of an attribute Ai. We assume Kira and Rendell (1992) have suggested an efficient that each domain is a set of Mi discrete values. feature selection algorithm, called Relief, which ∀i: Mi ≥ 2, finite. For numeric attributes, the evaluates each attribute by its ability to distinguish domain is a set of adjacent intervals. The among instances that are near each other. Their discretization of classifying continuous selection criterion, the feature relevance, is attributes is performed automatically in the applicable to numeric and nominal attributes. The process of feature selection (see next sub- greatest limitation of Relief is its inability to section). Continuous KPI’s may be discretized identify redundant features within a set of relevant manually into pre-defined intervals. features. Consequently, the set of features selected 4) Vij- a value No. j of domain Di. Consequently, by Relief may not be optimal. D i= (Vi1,..., ViMi). For discretized numeric To sum-up this section, the backward elimination attributes, each value represents an interval strategy is very inefficient for reducing between two continuous values. dimensionality of real-world data warehouses, 5) r - a relation instance (table) of the relation which may have hundreds and thousands of original schema R. attributes. On the other hand, the forward selection wrapper methods are highly expensive in terms of 6) n - number of tuples (records) in a relation r the computational effort. The filter algorithms are (n≥2). computationally cheaper, but they usually fail to 7) tk[Ai] - value of an attribute No. i in a tuple remove all redundant features. In the next section, (record) No. k. ∀k,i: tk[Ai] ∈ Di. we are describing the information-theoretic method of feature selection in data warehouses, based on the M. Last, O. Maimon 7-2 To find the set of classifying attributes in a relation Step 3 - Enter minimum significance level for of the operational database, we make the following splitting a network node (default = 0.1%). This partition of the relation schema: significance level is used by the likelihood ratio test (see Step 4.2.1.2.3 below). 1) O - a subset of target (dependent) attributes (O ⊂ R, |O| ≥1). These attributes represent the key Step 4 - Repeat for every target attribute (KPI) i: performance indicators (KPI’s) of an Step 4.1 - Initialize the information-theoretic organization. Such attributes are referred as network (one hidden layer including the root node facts in the data warehouse logical model (Inmon, 1994). Each connectionist network is associated with all tuples, no input attributes, one target layer for values of the target attribute). A aimed at predicting the values of a KPI, based new network is built for every target attribute. An on the values of input attributes (see below). example of the initial network structure for a three- 2) C - a subset of candidate input attributes (C ⊂ valued target attribute is shown in Figure 1. R, |C| ≥ 1). This is a subset of attributes (features), which can be related to the target attributes. 1 3) Ii - a subset of input attributes (classifying 0 features) selected by the algorithm as related to 2 the target attribute i (∀i: Ii ⊂ C). These 3 attributes are going to be loaded as dimensions into the data warehouse. Layer No. 0 Connection Target (the root node) Assumptions: Weights Layer 1) ∀i : I i ∩ O = ∅ (An attribute cannot be both an input and a target). Figure 1: Information-Theoretic Connectionist Network - Initial Structure 2) ∀i : I i ∪ O ⊆ R (Some attributes may be neither input, nor target). For example, the Step 4.2 - Repeat for the maximum number of attributes used as primary keys in the hidden layers (default = number of candidate input operational database (like a social security attributes). number) may be completely useless for a data Step 4.2.1 - Repeat for every candidate input warehouse, which has primary keys of its own. attribute i’ which is still not an input attribute: Sometimes, (e.g., in health care) the identifying attributes are removed from the data warehouse Step 4.2.1.1 - Initialize to zero the degrees of to preserve the privacy of the historic records. freedom, the estimated conditional mutual information, and the likelihood-ratio statistic of the 2.2 Feature Selection Procedure candidate input attribute and the target attribute, given the final hidden layer of nodes. Conditional The main steps of the feature selection procedure mutual information (Cover, 1991) measures the net are given below. decrease in the entropy (uncertainty) of the Step1 - Obtain the relation schema (name, type, and dependent attribute due to adding information about domain size of each attribute) and the schema each new classifying attribute. If the target is partition into a subset of candidate input and a completely independent of an attribute, the subset of target attribute (see the extended relational conditional mutual information, given that attribute model in sub-section 2.1 above). is zero. Step 2 - Read the relation tuples (records) from the Step 4.2.1.2 - Repeat for every node of the final operational database. Tuples with illegal or missing hidden layer: target values are ignored by the algorithm. Step 4.2.1.2.1 - Calculate the estimated conditional Step 2.1- Encode missing values of candidate input mutual information of the candidate input attribute attributes in a pre-determined form. and the target attribute, given the node. Step 2.2- Discretize each continuous attribute by Step 4.2.1.2.2 - Calculate the likelihood-ratio maximizing its mutual information with the target statistic of the candidate input attribute and the attribute. Mutual information (see Cover, 1991) is target attribute, given the node. defined as a decrease in the uncertainty of one Step 4.2.1.2.3 - If the likelihood-ratio statistic is attribute, given the value of another attribute. significant, mark the node as “splitted” and increment the conditional mutual information of the M. Last, O. Maimon 7-3 candidate input attribute and the target attribute, Details of the network construction procedure for a given the final hidden layer of nodes; else mark the single target attribute are provided in (Maimon et node as “unsplitted”. al., 1999). The application of the algorithm to design of data warehouses is presented here for the Step 4.2.1.2.4 - Go to next node. first time. Step 4.2.1.3 - Go to next candidate input attribute. 1,1 Step 4.2.2 - Find the candidate input attribute 1 1 1,2 maximizing the estimated conditional mutual 2 information. According to the information theory, 0 3,1 2 this attribute is the best predictor of the target, given 3 Layer No. 0 the values of the other input attributes. (the root node) 3 3,2 Step 4.2.3 - If the maximum estimated conditional Layer No. 1 Connection Target (First input Layer No. 2 Weights mutual information is greater than zero: Layer (Second input attribute) • Make the best candidate attribute an input attribute) attribute. • Define a new layer of hidden nodes for a Figure 2: Information-Theoretic Connectionist Cartesian product of splitted hidden nodes of Network - Two-Layered Structure the previous layer and values of the best candidate attribute. • Record the tuples associated with every node of 2.3 Extracting Functional Dependencies the new layer (a new node is defined if there is at least one tuple associated with it). Each connection between a terminal (unsplitted / final layer) node and a target node in the Else stop the search and output the subset Ii of input information-theoretic network represents an attributes (classifying features) associated with the association rule between a conjunction of input target attribute i. attribute-values and a target value. Due to the Step 4.3 - Go to next target attribute. inherent noisiness of the real-world data, this rule can be considered a weak (probabilistic) functional Step 5 - Define the set I of selected attributes dependency as opposed to deterministic functional (dimensions) as the union of sets of input attributes dependencies of primary and alternate keys with respect to every target attribute by: (Schouten, 1999). In the relational model (see Korth and Silberschatz, 1991), a functional |O| dependency between sets of attributes X and Y I =  Ii means that the values of the Y component of a tuple i =1 are determined by the values of the X component. On the contrast, an association rule between X and Y has a more limited meaning: given values of X, we Step 6 - End. can estimate the probability distribution of Y. In Figure 2, a structure of a two-layered network An information-theoretic weight of each rule is (based on two selected input attributes) is shown. given by: The first input attribute has three values, represented by nodes no. 1,2, and 3 in the first layer, but only nodes no. 1 and 3 are splitted due to the statistical P(Vij / z ) significance testing in Step 4.2.1.2 above. The w zij= P (Vij ; z ) • log P(Vij ) second layer has four nodes standing for the combinations of two values of the second input attribute with two splitted nodes of the first layer. Like in Figure 2, the target attribute has three Where values, represented by three nodes in the target P(Vij;z) - an estimated joint probability of the target layer. value Vij and the node z. P (Vij/ z) - an estimated conditional (a posteriori) probability of the target value Vij, given the node z. P(Vij) - an estimated unconditional (a priori) probability of the target value Vij. M. Last, O. Maimon 7-4 The connection weights express the mutual of tuples (n). In most cases the number of nodes information between hidden and target nodes. A will be much smaller than n, due to tuples having connection weight is positive if the conditional identical values and the statistical significance probability of a target attribute value, given the requirement of the likelihood-ratio test. node, is higher than its unconditional probability and negative otherwise. A weight close to zero The calculation of MI (Ai’ ; Ai / z) is performed at each hidden layer of every target attribute for all means that the target attribute value is almost candidate input attributes at that layer. The independent of the node value. This means that each rule having a positive connection weight can summation members of MI (Ai’ ; Ai / z) refer to a Cartesian product of values of a candidate input be interpreted as if node, then target value. attribute and a target attribute. The number of Accordingly, a negative weight refers to a rule of the form if node, then not target value. hidden layers is equal to p|C|. This implies that the total number of calculations is bounded by: The most informative rules can be found by ranking the information-theoretic connection weights (w zij) p|C| | O | •∑ n • M T • M C • (| C | −m) ≤ in decreasing order. Both the rules having the m =0 highest positive and the lowest negative weights are | O | •n • M T • M C • | C | 2 • p • (2 − p) of potential interest to a user. The sum of 2 connection weights at all unsplitted and final layer nodes is equal to the estimated mutual information between a set of input attributes and a target Thus, the feature selection algorithm can be applied attribute (see the definition of mutual information in to large-scale databases in a time, which is linear in Cover, 1991). According to the well-known Pareto the number of records and quadratic polynomial in principle, a small number of informative rules are the number of candidate input attributes. It is also expected to explain a major part of the total mutual directly proportional to the number of target information, which agrees with the experimental attributes (key performance indicators). results presented in the next section. 3 Case Study: Direct Marketing 2.4 Computational Complexity Database To calculate the computational complexity of the 3.1 Background and Objectives feature selection procedure, we are using the following notation: The original source of data is the Paralyzed Veterans of America (PVA), a non-profit n- total number of tuples in a training data set organization that provides programs and services for |C| - total number of candidate input attributes US veterans with spinal cord injuries or disease. With an in-house database of over 13 million p- portion of significant input attributes, donors, PVA is also one of the largest direct mail selected by the search procedure fundraisers in the US. The data set presents the m- number of hidden layers (input attributes), results of one of PVA's recent fund raising appeals. m ≤ |C| This mailing was sent to 3.5 million PVA donors who were on the PVA database as of June 1997. |O| - total number of target attributes Everyone included in this mailing had donated at MC - maximum domain size of a candidate input least once to PVA. attribute One group that is of particular interest to PVA is MT - maximum domain size of a target attribute "Lapsed" donors. These individuals donated to PVA 13 to 24 months ago. They represent an The computational “bottleneck” of the algorithm is important group to PVA, since the longer someone calculating the estimated conditional mutual goes without donating, the less likely they will be to information MI (Ai’ ; Ai / z) of the candidate input give again. Therefore, recapture of these former attribute Ai’ and the target attribute Ai, given a donors is a critical aspect of PVA's fund raising hidden node z. Since each node of m-th hidden efforts. The data set to be analyzed includes all layer represents a conjunction of values of m input lapsed donors, who received the mailing attributes, the total number of nodes at a layer No. m (responders and non-responders). The total dollar is apparently bounded by (Mc)m. However, we amount of gift is given for each responder. The restrict defining a new node (see step 4.2.3 above) attributes extracted from the database can be used by the requirement that there is at least one tuple for understanding the behavior of both the most and associated with it. Thus, the total number of nodes the least profitable individuals. This important at any hidden layer cannot exceed the total number M. Last, O. Maimon 7-5 insight may lead to more successful promotion • Attribute decoding. The original attributes, campaigns in the future. presenting donor promotion history and status, contain codes, where each byte has a different The dataset of lapsed donors, extracted from the meaning (e.g., recency, frequency and amount PVA database, is publicly available on the UCI of donation). These codes have been decoded KDD Archive [http://kdd.ics.uci.edu] as KDD Cup by splitting each encoded attribute into several 1998 Data. It has been originally used for the separate attributes. The decoding operation has Second International Knowledge Discovery and increased the total number of attributes from Data Mining Tools Competition in 1998. 481 to 518. 3.2 Database Characteristics • Missing values. Missing values have been replaced with the mode (the most frequent The special characteristics of the Direct Marketing value of an attribute). As recommended by the database include the following: data documentation, attributes containing 99.5 • Dimensionality. The data set to be used and more missing values have been omitted contains 95,412 tuples, including 5.1 % (4,843 from the analysis. cases) of responders to the mailing. Each tuple • Rare values. All values of nominal attributes has 481 attributes, namely one key, 478 input that occur less than 100 times in the data set and 2 target attributes. have been encoded as “Other”. • Input Attributes. There are 76 character • Transformation by division. To decrease the (nominal) and 402 numeric (continuous) number of distinct values for large scale attributes. These attributes include donor continuous attributes, the values of some demographic data (as collected by PVA and attributes have been divided by a constant third-party data sources), the promotion / factor (10, 100, etc.) and then rounded off to donation history, and the characteristics of the nearest integer number. donors neighborhood, as collected from the 1990 US Census. For privacy reasons, no • Discretization of Target Attribute. The identifying data (like the donor name, address, continuous target attribute TARGET_D has etc.) has been included in the data set. been discretized to equal width intervals of $10 donation each. The total number of • Target Attributes. There are two target discretization intervals has been 20, covering attributes in each tuple: the binary indicator for the attribute range between $0 and $200. response (the attribute TARGET_B) and the dollar amount of the donation (the attribute • Discretization of Input Attributes. Numeric TARGET_D). Naturally, the donation amount attributes have been discretized by using the is zero for all non-responders. Both can be significance level of 99%. For 191 attributes considered as key performance indicators (out of 404), no statistically significant partition (KPI’s) for the fund raising organization. has been found and these attributes have been omitted from the further stages of the analysis. • Data Quality. As indicated in the database The remaining 213 attributes have been left as documentation, some of the fields in the candidate input attributes. analysis file may contain data entry and/or formatting errors. 3.4 Feature Selection • Missing Values. Most input attributes contain a The process of feature selection in the Direct certain amount of missing values, which should Marketing database requires building separate be inferred from known values at the pre- networks for two KPI’s: TARGET_B (the binary processing stage. indicator for response) and TARGET_D (the dollar amount of the donation). However, when we have 3.3 Data Pre-processing applied the dimensionality reduction procedure of The pre-processing tasks included the following: sub-section 2.2 above to the first target attribute (TARGET_B), no significant input attributes have been found. In other words, no candidate input attribute, presenting in the database, is relevant to the fact of somebody responding to the mailing. Thus, we proceed with the results of the information-theoretic network built for the second target attribute (TARGET_D) only. M. Last, O. Maimon 7-6 The dimensionality reduction procedure of sub- As you can see from the description column, the section 2.2 above has been applied to all tuples of first four attributes selected represent the person’s responders to the raising appeal (4,843). The donation history, while the fifth significant input algorithm has selected five significant input attribute characterizes the donor neighborhood. The attributes, which are about 1% of the original most significant attribute LASTGIFT contributes candidate input attributes (478). Thus, the resulting alone about 90% of total mutual information. From information-theoretic network includes five hidden the viewpoint of interestingness, the last attribute layers only. The selected attributes and their (TPE13) seems to be the most unexpected one. It is information-theoretic scores are presented in Table rather unreasonable that an average human expert in 1 below. donations and direct mailing would pick this attribute out of the list of more than 500 attributes. Table 1 shows three information-theoretic measures of association between the input attributes and the If the PVA organization is designing a data target attribute: Mutual Information, Conditional warehouse for supporting its fundraising activities, Mutual Information, and Conditional Entropy. All the results of the feature selection algorithm can these parameters are based on the notion of Entropy cause a dramatic reduction in the amount of stored (see Cover, 1991), which represents the uncertainty data. Given that an average attribute (field) requires of a random variable. The entropy is measured in four bytes of memory, one donor tuple (record) of bits. Information on input attributes, associated 478 input attributes takes about 2k bytes. Since with the target, can decrease the uncertainty and the PVA has a donor database of about 13 million resulting entropy of the target. records, dimensionality reduction of 99% means saving about 25 GB of computer storage at the time The column “Mutual Information” shows the of data creation (1997). Perhaps, PVA could be cumulative association between a subset of input interested in storing additional information in its attributes, selected up to a given iteration data warehouse, based on its business expertise, but inclusively, and the target attribute. The next still the above results suggest that the majority of column, “Conditional MI (Mutual Information)” operational data is not needed for loading into the shows the net decrease in the entropy of the target data warehouse: that data is either redundant, or attribute “Target_D” due to adding each input completely irrelevant to PVA’s fundraising attribute. The last column (“Conditional Entropy”) performance. is equal to the difference between the unconditional entropy of Target_D (1.916 bits) and the estimated mutual information. Table 1: Direct marketing database – dimensionality reduction procedure Iteration Attribute Attribute Attribute Mutual Conditional Conditional Number Name Description Information MI Entropy 1 318 LASTGIFT Dollar amount of most 0.6976 0.6976 1.2188 recent gift 2 316 MAXRAMNT Dollar amount of largest gift 0.7599 0.0624 1.1564 to date 3 314 MINRAMNT Dollar amount of smallest gift 0.7671 0.0072 1.1492 to date 4 319 LASTDATE Date associated with the most 0.7711 0.004 1.1452 recent gift 5 233 TPE13 Percent Traveling 15 - 59 0.7737 0.0025 1.1427 Minutes to Work M. Last, O. Maimon 7-7 network. Most rules in this table indicate that there 3.5 Functional Dependencies is a direct relationship between the donation history and the actual amount of donation (TARGET_D). The connections between input and target nodes in Thus, rule no. 2 says that if the last gift is between the information-theoretic network, constructed in $20.5 and $28, then the actual donation is expected the previous sub-section, have been used to extract to be between $20 and $29 (almost in the same disjunctive association rules between the significant range). A similar interpretation can be given to rule input attributes and the target attribute (Target_D). no. 3: those who were poor donors for the last time The total of 935 positive and 85 negative rules have are expected to preserve their behavior. Rule no. 1 been extracted. The rules have been scored by the is a slight but interesting exception: it says that information-theoretic weights of their connections those who donated about $20 for the last time are (see sub-section 2.3 above). expected to increase their donation by as much as In Table 2 below, we are presenting only the rules 50%. having the highest connection weights in the Table 2: Direct marketing database – highest positive connection weights Rule LASTGIFT MAXRAMNT MINRAMNT TARGET_D weight Cum. Weight Cum. Percent 1 $19 - $20.25 $20 - $29 0.1164 0.1164 11.9% 2 $20.5 - $28 $20 - $29 0.0859 0.2023 20.7% 3 $1 - $5 $0 - $9 0.0841 0.2864 29.3% 4 $12 - $14 $10 - $19 0.0596 0.346 35.4% 5 $1 - $5 $7 - $9 $0 - $9 0.0534 0.3994 40.9% 6 $1 - $5 $10 - $11 $0 - $9 0.0402 0.4396 45.0% 7 $6 - $7 $7 - $9 $0 - $9 0.0392 0.4788 49.0% 8 $10 - $11 $10 - $11 $10 - $19 0.0377 0.5165 52.9% 9 $20.5 - $28 $30 - $39 0.0307 0.5472 56.0% 10 $15 - $15.5 $15 - $15.5 $4.68 - $8.99 $10 - $19 0.0293 0.5765 59.0% M. Last, O. Maimon 7-8 J.F. Elder IV and D. Pregibon, A Statistical Perspective on Knowledge Discovery in Databases, In Advances in Knowledge 4 Conclusions Discovery and Data Mining, U. Fayyad, G. The paper presents a method for automated Piatetsky-Shapiro, P. Smyth, and R. selection of attributes in a data warehouse. The Uthurusamy, Eds. AAAI/MIT Press, method is based on the information-theoretic Menlo Park, CA, pp. 83-113, 1996. approach to knowledge discovery in databases (Maimon et al., 1999). It does not require any prior V. R. Gupta, An Introduction to Data knowledge about the business domain, can Warehousing, System Services eliminate both irrelevant and redundant features, Corporation, Chicago, Illinois, 1997. and is applicable to any type of data (nominal, numeric, etc.). While no automated procedure can W.H. Inmon and R.D. Hackathorn, Using the fully replace a human expert, the decisions of data Data Warehouse, John Wiley & Sons, New warehouse designers can certainly be supported by a York, 1994. feature selection system presented here. As shown by the direct marketing example, integrating the G. H. John, R. Kohavi, and K. Pfleger, Irrelevant automated feature selection in the design process Features and the Subset Selection can lead to a significant dimensionality reduction of Problem, Proc. of the 11th Int'l Conf. on data warehouse. Machine Learning, pages 121-129, Morgan Kaufmann, 1994. Developing automated approaches to the design of data warehouses requires further investigation of several issues. For example, automated methods K. Kira and L.A. Rendell, The Feature Selection need to be developed for determining the set of Problem: Traditional Methods and a New transformations to be applied to the original Algorithm, Proc. of AAAI'92, pages 129- attributes. A cost associated with each attribute 134, 1992. (e.g., cost of third-party data) can be integrated in the feature selection procedure. The data model H.F. Korth and A. Silberschatz, Database System (see sub-section 2.1 above) can be further extended Concepts, McGraw-Hill, Inc., New York, to include candidate input attributes from multiple 1991. relations (via foreign keys). Another important problem is detecting dynamic changes in the weak H. Liu and H. Motoda, Feature Selection for functional dependencies and updating the data Knowledge Discovery and Data Mining, warehouse structure accordingly. Kluwer, Boston, 1998. References O. Maimon, A. Kandel, and M. Last, Information- Theoretic Fuzzy Approach to Knowledge H. Almuallim and T. G. Dietterich, Efficient Discovery in Databases. In Advances in Algorithms for Identifying Relevant Soft Computing - Engineering Design and Features, Proc. of 9th Canadian Conf. on Manufacturing, R. Roy, T. Furuhashi and AI, pages 38-45, Morgan Kaufmann, 1992. P.K. Chawdhry, Eds. Springer-Verlag, London, pp. 315-326, 1999. S.D. Bay, The UCI KDD Archive [http://kdd.ics.uci.edu], Irvine, CA: H. Schouten, Analysis and Design of Data University of California, Department of Warehouses, Proc. International Workshop Information and Computer Science,1999. on Design and Management of Data Warehouses (DMDW’99), Heidelberg, R. Caruana and D. Freitag, Greedy Attribute Germany, 1999. Selection, Proc. of 11th Conf. On Machine Learning, pages 28-36, Morgan Kaufmann, 1994. T. M. Cover, Elements of Information Theory, Wiley, New York, 1991. M. Last, O. Maimon 7-9