Confusion Matrix-based Feature Selection Sofia Visa Brian Ramsay Anca Ralescu Esther van der Knaap Computer Science Department Sentry Data Systems, Inc. Computer Science Department Ohio Agricultural Research College of Wooster Fort Lauderdale, FL University of Cincinnati and Development Center Wooster, OH brian.ramsay@gmail.com Cincinnati, OH The Ohio State University svisa@wooster.edu anca.ralescu@uc.edu Wooster, OH vanderknaap.1@osu.edu Abstract numbers of attributes (e.g. thousands). Examples of such data domains with many features include text categorization This paper introduces a new technique for feature selection and gene expression analysis. In the first example, each doc- and illustrates it on a real data set. Namely, the proposed ap- ument is described by the most frequent words, leading to proach creates subsets of attributes based on two criteria: (1) 20,000 words or more. In working with expressed genes in individual attributes have high discrimination (classification) order to separate healthy from cancer patients, for example, power; and (2) the attributes in the subset are complemen- the number of attributes may grow as high as 30,000 (Guyon tary - that is, they misclassify different classes. The method and Elisseeff 2003). Another example of a challenging do- uses information from a confusion matrix and evaluates one main is the microarray data found in (Xin, Jordan, and Karp attribute at a time. Keywords: classification, attribute selec- 2001), where a hybrid of filter and wrapper approaches is tion, confusion matrix, k-nearest neighbors; employed to successfully select relevant features to classify 72 data examples in a 7,130 dimensional space. Background In addition to reducing the data dimensionality, selecting fewer attributes may improve classification and may give a In classification problems, good accuracy in classification is better understanding of the underlying process that gener- the primary concern; however, the identification of the at- ated that data. Here we propose an attribute-selection tech- tributes (or features) having the largest separation power is nique based on a confusion matrix with the two-fold objec- also of interest. Even more, for very large data sets (such tive of better classification and better data understanding. as MRI images of brain), the classification is highly depen- dent on feature selection. This is mainly because the larger Depending on where the feature selection module is placed the number of attributes, the more sparse the data become in relation to the classification module, there are two classes and thus many more (exponential growth) training data are of methods for feature selections (Jain and Zongker 1997): necessary to accurately sample such a large domain. In this sense, the high dimensional data sets are almost always un- • Filter methods (Pudil, Novovicova, and Kittler 1994) der represented. This problem is also known in literature rank features (or feature subsets) independently of the as ”the curse of dimensionality”. For example, a 2-attribute predictor. These methods investigate irrelevant features data set having 10 examples in the square defined by the cor- to be eliminated by looking at correlation or underlying ners (0,0) and (1,1) covers the domain acceptably. If the do- distribution. For example, if two attributes have the same main to be learned is the cube defined by the corners (0,0,0) probability distribution, then they are redundant and one and (1,1,1), 10 points will not cover this 3-D domain as ef- of them can be dropped. Such analysis is performed re- fectively. gardless of the classification method. Another filtering method ranks attributes based on the notion of nearest hit Reducing the number of attributes for a classification prob- (closest example of same the class) and nearest miss (clos- lem is a much researched field. The brute force approach in est example of a different class) (Kira and Rendell 1992). finding the best combination of attributes for classification The it h feature ranking is given by the score computed as requires the trial of all possible combinations of the avail- the average (over all examples) of the difference between able n attributes. That is, consider one attribute at a time, the distance to the nearest hit and the distance to the near- then investigate all combinations of two attributes, three at- est miss, in the projection of the it h dimension (Guyon tributes, etc. However, this approach is unfeasible because and Elisseeff 2003). there are 2n − 1 such possible combinations for n attributes and, for example, even for n=10 there are 1,023 different • Wrapper methods (Kohavi and John 1997) use a classi- attribute combinations to be investigated. Additionally, fea- fier to assess features (or feature subsets). For example, ture selection is especially needed for data sets having large the decision tree algorithm selects the attributes having We define the disagreement score associated with a confu- Table 1: The confusion matrix for two-class classification sion matrix in equation (3). According to this equation the problem. disagreement is 1 when one of the quantities b or c is 0 (in this case the classifier misclassifies examples of one class P REDICTED P REDICTED only), and is 0 when b and c are the same. N EGATIVE P OSITIVE ACTUAL N EGATIVE a b 0 if b = c = 0;  ACTUAL P OSITIVE c d D= |b−c| (3) max{b,c} otherwise. The attribute selection methodology proposed here selects the best discriminatory power and places them closer to attributes that not only have good discrimination power on the root. Hence, besides the classification tree, a ranking their own, but more importantly are complementary to each of attributes results. other. For example, consider two attributes A1 and A2, hav- ing similar classification accuracy. Our approach will select them as a good subset of attributes if they have a large dis- Another classification of attribute selection methods consid- agreement in terms of what examples they misclassify. A ers the search technique of the feature subsets. There are two large disagreement is indicated by D values closer to 1 for main greedy search strategies: forward selection and back- both attributes, but distinct denominators in equation (3). ward elimination. Both techniques yield nested subsets of features. The forward selection starts with one attribute and continues adding one attribute at a time if the newly formed subset gives better classification. During backward elimi- Algorithm for Confusion Matrix-based nation, unpromising attributes are progressively eliminated Attribute Selection (Guyon and Elisseeff 2003). This greedy search technique is often used in system identification. The result, in either The pseudocode outlined below shows the steps to per- case, is not guaranteed to yield the optimal attribute subset form confusion matrix-based attribute selection for a 2-class (Sugeno and Yasukawa 1993). classification problem. This method basically constructs attribute-subsets that: (1) have attributes with good individ- In this research we investigate the use of the confusion ma- ual classification power, and (2) have attributes that are com- trix (Kohavi and Provost 1998) (which contains information plementary (i.e. they disagree in their misclassifications). about actual and predicted classifications) for attribute se- lection. In the context described above, this approach is a Note that the algorithm may lead to several subsets of at- wrapper method because it uses a classifier to estimate the tributes to be further investigated, i.e. further the subset classification power of an attribute (or subset of attributes). yielding higher classification accuracy may be selected. Also, the algorithm does not account for the possibility that The Confusion Matrix and Disagreement two individually lower ranked attributes may combine in a Score high classification accuracy subset due to their high comple- mentarity. A confusion matrix of size n x n associated with a classi- fier shows the predicted and actual classification, where n is Algorithm 1 Pseudocode for Confusion Matrix-based At- the number of different classes. Table 1 shows a confusion tribute Selection Algorithm matrix for n = 2, whose entries have the following meanings: Require: 2-class data of n attributes Require: classification technique Require: k - number of member subset • a is the number of correct negative predictions; Ensure: Output k-attribute subset as tuple Sk = • b is the number of incorrect positive predictions; (A1 , A2 , ..., Ak ) Compute classifier Ci based on feature Ai , i = 1..n • c is the number of incorrect negative predictions; Obtain: Accuracy(Ci ) and Conf M atrix(Ci ) • d is the number of correct positive predictions. Rank Ai according to Accuracy(Ci ) ⇒ RA for i = 1 ... n do Compute disagreement based on Conf M atrix(Ci ) The prediction accuracy and classification error can be ob- |b−c| as: Di = max{b,c} tained from this matrix as follows: end for a+d Rank Ai according to Di ⇒ RD Accuracy = (1) a+b+c+d Select top k (according to RA ) attributes having large D (according to RD ) but in different classes: ⇒ Sk = b+c (A1 , A2 , ..., Ak ) Error = (2) a+b+c+d In addition to tomato classification, of interest here is to find which attributes have more discriminative power and further to find a ranking of the attributes. Data Classification and Attribute Selection In this section we show the decision tree classification of the tomato data set, then we illustrate our attribute selection algorithm (in combination with a k-nearest neighbor classi- fier) on two (out of 8) classes. These two classes (1 and 7) are identified by both, decision trees and k-nearest neigh- bors, as highly overlapping. Classification with Decision Trees - CART We used the Classification and Regression Trees (CART) Figure 1: Decision tree obtained with CART for all data and method (Breiman et al. 1984) because it generates rules that all attributes. can be easily understood and explained. At the same time, classification trees have a built-in mechanism to perform at- tribute selection (Breiman et al. 1984) and we can com- pare our set of selected attributes, obtained from the confu- sion matrix and k-nearest neighbors analysis, with the set of attributes identified by CART. However, we anticipate that these sets will not perfectly coincide, which only means that Table 2: Data distribution across classes. In total, there are the two approaches quantify the importance of a given at- 416 examples each having 34 attributes. tribute (or subset of attributes) differently and that the two Class Class label No. of examples methods learn the data differently. 1 Ellipse 110 2 Flat 115 3 Heart 29 The pruned decision tree obtained using CART is shown in 4 Long 36 Figure 1. The train and test error associated with this tree are 5 Obvoid 32 11.54% and 18.27%, respectively. As it can be seen from 6 Oxheart 12 this figure, 10 rules can be extracted. In addition, CART se- 7 Rectangular 34 lects the following 8 attribute as best in classification (listed 8 Round 48 in decreasing order of their importance): • 7 - Fruit Shape Idx Ext1 The Tomato Fruit Data Set • 13 - Circular • 12 - Ellipsoid The data set used in the experimental part of this research • 11 - Fruit Shape Triangle consists of 416 examples having 34 attributes and distributed • 14 - Rectangular in 8 classes (the class-distribution is shown in Table 2). This set was obtained from the Ohio Agricultural Research and • 10 - Distal Fruit Blockiness Development Center (OARDC) research group led by E. • 8 - Fruit Shape Idx Ext2 Van Der Knaap (Rodriguez et al. 2010) and the classifica- • 1 - Perimeter tion task is to correctly label a tomato fruit based on mor- phological measurements such as width, length, perimeter, circularity (i.e. how well a transversal cut of a tomato fits a We also investigate the k-nearest neighbors classifier as we circle), angle at the tip of the tomato, etc. will use this classification method in combination with our attribute selection approach. Figure 2 shows the k - nearest The data set was collected as follows: from the scanned im- neighbors classification error for k = 2,...,15. The top (blue) age of a longitudinally section of a tomato fruit the 34 mea- line corresponds to runs that include all 34 attributes and the surements are extracted by the Tomato Analyzer Software bottom (red) line shows the error when only the best five (TA) (Rodriguez et al. 2010) developed by the same group. attributes (identified by CART classification) are used. For a complete description of the 34 tomato fruit measure- ments and the TA software see (Gonzalo et al. 2009). Figure 2: K - nearest neighbors classification error for k = 2,...,15. The top (blue) line corresponds to runs that include all 34 attributes and the bottom (red) line shows the error when only the best five attributes are used (these attributes were identified through CART classification). As shown in Figure 2, the k-nearest neighbors classifica- tion technique consistently scores lower error when using Figure 3: Confusion matrix for all classes and all attributes. only 8 attributes (the one selected by CART), rather then all Class 7 has 8 examples wrongly predicted as class 1 (see top 34. Thus, a natural question arising here is: is there a better row). combination of attributes then the one selected by CART for classification? For k = 4 the k-nearest neighbors technique yields the lowest error, which justifies our choice of using k = 4 in the next experiments. The Confusion Matrix for the Tomato Data Set When using all 34 attributes and all 8 classes, the confu- sion matrix obtained from the k-nearest neighbors cluster- ing with k=4 is shown in Figure 3, where along the x-axis are listed the true class labels and along the y-axis are the k- nearest neighbors class predictions. Along the first diagonal are the correct classifications, whereas all the other entries show misclassifications. The bottom right cell shows the overall accuracy. In this confusion matrix it can be seen that 8 examples of class 7 are wrongly predicted as class 1. Additionally, from the CART classification, the classes 1 and 7 are identified as the two classes overlapping the most. Thus the experiment presented next uses the confusion matrix attribute selection to better separate these two classes. Namely, we search for a subset of the 34 attributes such that the attributes are com- Figure 4: Confusion matrix for class 1 and 7 along attribute plementary in the sense described above and quantified in 14. Four examples of class 1 are misclassified as class 7, and equation (3). 3 examples of class 7 belong to class 1. Table 3: Attribute ranking based on disagreement score. The best classification attributes found by CART are shown in bold (they are also underlined). The attributes marked by (*) are the ones identified by our selection algorithm. Attr. number Disagreement score Class of largest error 20 1 7 22 1 7 24 1 7 25 1 7 26 1 7 27 1 7 31 1 7 23 0.9655 7 28 0.9565 7 15 0.9524 7 2 0.9375 7 21 0.9375 7 Figure 5: Confusion matrix for class 1 and 7 along attribute 29 0.9231 7 20. Fourteen examples of class 7 are misclassified as class 12 0.9167 7 30 0.9167 7 1. 1 0.9091 7 3 0.9091 7 7 0.9000 7 17 0.9000 7 5 0.8889 7 11 0.8824 7 32 0.8750 7 34 0.8667 7 18 0.8462 7 Confusion Matrix-based Attribute Selection for 13 0.8235 7 Classes 1 and 7 19 0.8235 7 6 0.8182 7 33 0.8125 7 When using the data from classes 1 (Ellipse) and 7 (Rectan- 4 0.7273 7 gular), a data set of size 145 is obtained (110 in class 1 and 9* 0.7273 7 34 from class 7). 16* 0.6875 7 8* 0.6250 7 As illustrated in Algorithm 1, for each of the 34 attributes, 10* 0.3000 7 the k-nearest neighbor algorithm (with k = 4) is used for 14* 0.2500 1 classification and the corresponding classification accuracy and confusion matrix are obtained (for each attribute). Fur- ther, the 34 attributes are ranked in the order of their indi- vidual performance in distinguishing between class 1 and 7, yielding a 97.2% correct classification. leading to the ranking set R = 14, 7, 8, 17, 1, 3, 6, 12, 30, 4, 9, 20, 29, 18, 26, 2, 10, 21, 34, 32, 11, 33, 5, 13, 19, 16, 15, The above approach is a (greedy) forward selection method, 31, 25, 28, 24, 27, 22, 23. i.e. the attributes are progressively incorporated in larger and larger subsets. However, here we incorporate the attributes We first create growing nested subsets of attributes in the or- in the order dictated by their individual performance, yield- der specified by their individual classification abilities. Note ing nested subsets of attributes. For example, we do not that this particular choice of subsets is not part of Algorithm evaluate the performance of the subset consisting of first and 1 and makes no use of the complementarity. We simply in- third attribute. Indeed, it may be that this subset can perform troduce it as a comparative model for our selection approach, better then considering all top three attributes. However, as which, besides the accuracy ranking, incorporates comple- indicated earlier in this paper, evaluating all possible com- mentarity information as well. bination is not a feasible approach. Thus, we propose to combine the attributes that are complementary, i.e. two Figure 6 shows the classification accuracy for subsets of at- (or more) attributes that may achieve individually simi- tributes consisting of the top 1, top 2, top 3, etc. attributes lar classification accuracy but they have the largest dis- from R (the subsets are shown on x-axis, while the y-axis agreement (this information is extracted from the confu- shows classification accuracy). From Figure 6 it can be sion matrix). seen that the highest accuracy is achieved when the top 3 attributes are used together (i.e. attributes 14, 7, and 8), The disagreement scores for all 34 attributes when classify- ing data from classes 1 and 7 are listed in Table 3, column 2. As column 3 of the same table shows, only attribute 14 misclassifies more examples of class 1 then of class 7 (see bottom row). All other 33 attributes have the largest num- 0.98 ber of misclassifications attributed to class 7. Thus, for this particular data set, we will consider subsets that combine at- 0.97 tribute 14 with one or more other attributes having the largest disagreement. 0.96 0.95 For example, Figures 4 and 5 show the confusion matrix for Classification accuracy classes 1 and 7 when using attribute 14 and 20, respectively. 0.94 These two attributes disagree the most as the above figures and Table 3 show. 0.93 Algorithm 1 is illustrated here for k = 2,3,4, and 5 only (note 0.92 for k =1 the classification results are the same as in Figure 6). X: 11 Y: 0.9097 The classification accuracy for these experiments is shown 0.91 in Figures 7, 8, 9, and 10, respectively. In addition to select- 0.9 ing the top k attributes in Table 3 (this combination yields 0 5 10 15 20 Top attributes selected for classification 25 30 35 the first star in the above plots), we also plot the classifica- tion accuracy of the sliding (moving from top to bottom in Table 3) window of k attributes. Figure 6: K - nearest neighbors classification accuracy for k = 4 when using data from classes 1 and 7 only. On x-axis are Figures 7, 8, 9, and 10 show the classification accuracy of listed the nested subsets of attributes having top 1,2,3,...,34 the k-nearest neighbor algorithm when attribute 14 is com- attributes. The highest accuracy (97.2%) is obtained for the bined with all the other attributes in decreasing order of their subset having the top 3 attributes: 14,7, and 8. disagreement scores (see Table 3). Figure 7 shows results for 2-member subsets and the results from Figure 8 are obtained for 3-member subsets: attribute 14 and two consecutive at- tributes from Table 3. As Figure 10 illustrates, simply selecting the top attributes from Table 3 (having the largest disagreement) does not en- sure a better classification, nor is an increasing or decreas- 0.98 ing trend observed when sliding down the table. This is be- 0.97 cause classification ability is not additive and opens up the question of whether a better k-subset of attributes can be 0.96 obtained by mixing attributes across the table, not only the 0.95 k-neighbors selection used here. 0.94 Accuracy Among the k-member subsets investigated here (note, there 0.93 are more sliding window subsets for k > 5), the largest clas- 0.92 sification accuracy (98%) is achieved for a 5-member sub- set, namely for the attribute-set 14, 9, 16, 8, and 10. The 0.91 CART classifier recognizes these two classes with 93% ac- 0.9 curacy (using attributes 7, 13, 12, 14, 11, and 10), and the accuracy-ranking only (no complementarity information in- 0.89 corporated) selection achieves 97.3% (using top 3 attributes: 0.88 20 22 24 25 26 27 31 23 28 15 2 21 29 12 30 1 3 7 17 5 11 32 34 18 13 19 6 33 4 9 16 8 10 14, 7 and 8). The attribute subset with the largest discrimi- Attribute nating power (when using a k-nearest neighbors clustering, k = 4) is obtained with the confusion matrix-based attribute se- lection; however, it is not a large improvement as the classes Figure 7: K - nearest neighbors classification accuracy for k are pretty well separated to begin with. = 4 when using 2-member subsets of attributes. Each sub- set contains attribute 14 and one of the remaining attributes; x-axis shows these subsets listed in the order of their com- plementarity - see Table 3. Largest accuracy is 97.3%. 0.97 0.96 0.98 0.95 0.97 0.94 0.96 0.95 0.93 Accuracy 0.94 0.92 Accuracy 0.93 0.91 0.92 0.9 0.91 0.89 0.9 0.88 0 5 10 15 20 25 30 35 0.89 Attribute subset 0.88 0 5 10 15 20 25 30 35 Atribute subset Figure 8: K - nearest neighbors classification accuracy for k = 4 when using 3-member subsets of attributes. Each sub- set contains attribute 14 and two consecutive attributes (in the order of their complementarity - see Table 3). Largest Figure 9: K - nearest neighbors classification accuracy for k accuracy is 97.3%. = 4 when using 4-member subsets of attributes. Each sub- set contains attribute 14 and three consecutive attributes (in the order of their complementarity - see Table 3). Largest accuracy is 96.7%. 0.98 0.97 Conclusions and Future Work 0.96 0.95 A new technique for attribute selection is proposed here. 0.94 Accuracy The method selects attributes that are complementary to 0.93 each other, in the sense that they misclassify different 0.92 classes, and favors attributes that have good classification abilities by themselves. This new approach is illustrated on 0.91 a real data set. For two classes of interest within this data set, 0.9 this technique found a better (i.e. yielding higher classifica- tion accuracy) subset of attributes, than using all attributes or 0.89 even using the 8 attributes identified by CART. However, we 0.88 must investigate this new approach in more data sets and in 0 5 10 15 Attribute subset 20 25 30 35 combination with other classification techniques (here only the k-nearest neighbor classifier was investigated). Another future direction is to investigate the use of subsets that com- Figure 10: K - nearest neighbors classification accuracy for bine complementary attributes, even if these attributes are k = 4 when using 5-member subsets of attributes. Each sub- weak classifiers by themselves. The challenging factor for set contains attribute 14 and four consecutive attributes (in this approach is the large number of subsets that must be in- the order of their complementarity - see Table 3). Largest vestigated. Depending on the data set, if this search space accuracy is 98%. is very large, then genetic algorithms can be used to explore the version space. We must also extrapolate this method to multi-class data sets and investigate its scalability factor. Acknowledgments feature selection. In International Conference on Machine Learning, 368–377. Esther van der Knaap acknowledges support from the NSF Kohavi, R., and John, G. 1997. Wrappers for features grant NSF DBI-0922661. Sofia Visa was partially supported subset selection. Artificial Intelligence 97:273–324. by the NSF grant DBI-0922661(60020128) and by the Col- Kohavi, R., and Provost, F. 1998. On Applied Research lege of Wooster Faculty Start-up Fund. in Machine Learning. In Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, Columbia University, New York, vol- ume 30. References Pudil, P.; Novovicova, J.; and Kittler, J. 1994. Floating search methods in feature selection. Pattern Recognition Breiman, L.; Friedman, J.; Olshen, R.; and Stone, C., eds. Letters 15(11):1119–1125. 1984. Classification and Regression Trees. CRC Press, Rodriguez, S.; Moyseenko, J.; Robbins, M.; Boca Raton, FL. Huarachi Morejn, N.; Francis, D.; and van der Knaap, E. Gonzalo, M.; Brewer, M.; Anderson, C.; Sullivan, D.; 2010. Tomato Analyzer: A Useful Software Application Gray, S.; and van der Knaap, E. 2009. Tomato Fruit Shape to Collect Accurate and Detailed Morphological and Col- Analysis Using Morphometric and Morphology Attributes orimetric Data from Two-dimensional Objects. Journal of Implemented in Tomato Analyzer Software Program. Jour- Visualized Experiments 37. nal of American Society of Horticulture 134:77–87. Sugeno, M., and Yasukawa, T. 1993. A fuzzy-logic-based Guyon, I., and Elisseeff, A. 2003. An Introduction to Vari- approach to qualitative modeling. IEEE Transactions on able and Feature Selection. Journal of Machine Learning fuzzy systems 1(1):7–31. Research 3:1157–1182. Xin, E.; Jordan, M.; and Karp, R. 2001. Feature Selec- Jain, A., and Zongker, D. 1997. Feature selection: eval- tion for High-Dimensional Genomic Microarray Data. In uation, application, and small sample performance. IEEE Proceedings of the 18 International Conference in Machine Transactions on Pattern Analysis and Machine Intelligence Learning ICML-2001, 601–608. 19(2):153–158. Kira, K., and Rendell, L. 1992. A practical approach to