-

Estimation and Feature Selection by Application of Knowledge Mined from Decision Rules Models

Wieslaw Paja

wpaja@ur.edu.pl 0

Krzysztof Pancerz

kpancerz@wszia.edu.pl 1 2 0 Department of Computer Science, Faculty of Mathematics and Natural Sciences, University of Rzeszow , Prof. St. Pigonia Str. 1, 35-310 Rzeszow , Poland 1 University of Information Technology and Management Sucharskiego Str. 2, 35-225 Rzeszow , Poland 2 University of Management and Administration Akademicka Str. 4, 22-400 Zamosc , Poland

57 68

Feature selection methods, as a preprocessing step to machine learning, are effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. However, the recent increase of dimensionality of data poses a severe challenge to many existing feature selection methods with respect to the efficiency and effectiveness. In this work, we introduce a novel concept, relevant feature selection based on information gathered from decision rule models. A new measure for a feature rank based on the feature frequency and rule quality is additionally defined. The efficiency and effectiveness of our method is demonstrated through exemplary use of five real-world datasets. Six different classification algorithms were used to measure the quality of learning models built on original features and on selected features.

Feature selection feature ranking decision rules dimensionality reduction relevance and irrelevance

In the era of the acquisition of vast amounts of data, different domain information databases, efficient analysis and retrieval of regularity have become an extremely important task. The issue of classification and object recognition is applied in many fields of human activity. Data mining is fraught with many aspects which hinder it, like a very large number of observations, too many attributes, the insignificance of the part of variables for the classification process, mutual interdependence of conditional variables, the simultaneous presence of variables with different types, the presence of undefined values of variables, the presence of erroneous values of the variables, uneven distribution of categories for the target variable. Thus, the development of efficient methods for significant feature selection is valid.

Feature selection (FS) methods are frequently used as a preprocessing step to machine learning experiments. An FS method can be defined as a process of choosing a subset of original features so that the feature space is optimally reduced according to a certain evaluation criterion. Feature selection has been a fruitful field of research and development since 1970’s and it has been proven to be effective in removing irrelevant features, increasing efficiency in learning tasks, improving learning performance like predictive accuracy, and enhancing comprehensibility of the learned results [1].

The feature selection methods are typically divided into three classes based on how they combine the selection algorithm and the model building: filter, wrapper and embedded FS methods. Filter methods select features with respect to the model. They are based only on general features like the correlation with the variable to be predicted. These methods select only the most interesting variables. Then, a selected subset will be a part of the classification model. Such methods are effective in computation time and robust to overfitting [ 2 ]. However, some redundant, but relevant features can remain unrecognized. In turn, wrapper methods evaluate subsets of features which allow to detect the possible interactions between variables [ 1, 3 ]. However, the increase in overfitting risk, when the number of observations is insufficient, is possible. Additionally, the significant computation time, when the number of variables is large, highly increases. The third type, called embedded methods, is intended for reducing the classification of learning. Methods in this group try to combine the advantages of both methods mentioned previously. Thus, the learning algorithm takes advantage of its own variable selection algorithm. Therefore, it needs to know initially what a good selection is, which limits its exploitation [ 4 ].

Kohavi and John [1] observed that there are several definitions of relevance that may be contradictory and misleading. They proposed two degrees of relevance (strong and weak) that are required to encompass all notions usually associated with this term. In their approach the relevance is defined in the absolute terms, with the help of the ideal Bayes classifier. In this context, a feature X is strongly relevant when removal of X alone from the data always results in deterioration of the prediction accuracy of the ideal Bayes classifier. In turn, a feature X is weakly relevant if it is not strongly relevant and there exists a subset of features S, such that the performance of the ideal Bayes classifier on S is worse than the performance on S ∪ {X}. A feature is irrelevant if it is neither strongly nor weakly relevant.

Nilsson et al. [ 5 ] introduced the formal definition of two different feature selection problems: 1. Minimal Optimal Feature Selection (MOFS) consisting in identification of minimal set of features to obtain the optimal quality of classification. 2. All Relevant Feature Selection (ARFS)), where the problem is to find all the variables that may, under certain conditions, improve the classification.

There are two important differences between these problems. The first one is detection of attributes with low importance (ARFS) [ 6 ], which may be completely obscured by other, more important attributes, from the point of view of the classifier (MOFS). The second difference is to find the boundary between the variables poorly, but realistically related to the decision and those for which such a relation is created as a result of random fluctuations. The formal definition of the problem of all relevant feature selection (ARFS) as a distinct problem from the classical minimal optimal feature selection (MOFS), was proposed recently in 2007 [ 5 ].

In our research, we used the contrast variable concept to distinguish between relevant and irrelevant features [ 6 ]. It is a variable that does not carry information on the decision variable by design that is added to the system in order to discern relevant and irrelevant variables. Here, it is obtained from the real variables by random permutation of values between objects. The use of contrast variables was, for the first time, proposed by Stoppiglia et al. [ 7 ] and then by Tuv et al. [ 8 ]. 2

Methods and Algorithms

During experiments the following general procedure was applied: 1. Step 1. Selection of dataset and features for investigation.

(a) Application of a set of ranking measures to calculate importance for each feature: i. With set of contrast features.

ii. Without contrast features.

(b) Definition (selection) of the most important feature subset. 2. Step 2. Application of different machine learning algorithms for classification of unseen objects using the 10-fold cross validation method: (a) Using all original features.

(b) Using only selected, important features. 3. Step 3. Comparison of gathered results using different evaluation measures.

In the first step, a dataset as well as a feature for investigation were defined. Then, different ranking measures were applied to estimate importance of each feature. In order to check specificity of the feature selection, the dataset was extended by adding contrast variables. It means that each original variable was duplicated and its values were randomly permuted between all objects. Hence, a set of non-informative by design shadow variables was added to original variables. The variables that were selected as important more significantly than random, were examined further, using different tests. To define the level of feature importance, six well-known ranking measures were applied: ReliefF, Information Gain, Gain Ratio, Gini Index, SVM weight, and RandomForest. Additionally, our new measure, called RQualityFS, was introduced. It is based on the frequency of presence of different feature in a rule model generated from an original dataset and it also takes into consideration the quality of the rules in which this feature occurs. Rank quality of the i-th feature could be presented as follow:

n QAi = X QRj {Ai}

j=1 QRj =

Ecorr

Ecorr + Eincorr where n is a number of rules inside the model, QRj defines the classification quality of the rule Rj and {Ai} describes the presence of the i-th attribute, usually it is equal to 1 (the feature occurred) or to 0 (the feature did not occur).

In turn, the quality of the rule is defined as follows: (1) (2) where Ecorr depicts a number of correctly matched learning objects by the j-th rule and Eincorr depicts a number of incorrectly matched learning objects by this rule.

During the second step, a test probing the importance of variables was performed by analyzing the influence of variables used for model building on the prediction quality. Six different machine learning algorithms were applied to build different predictors for the original set of features and for selected features: Classification Tree (CT), Random Forest (RF), CN2 decision rules algorithm (CN2), Naive Bayes (NB), k-Nearest Neighbors (kNN), and Support Vector Machine (SVM). During this step, a 10-fold cross validation paradigm was used. Ten known evaluation measures were uti-lized in each predictor: Classification Accuracy (CA), Sensitivity, Specificity, Area Under ROC curve (AUC), Information Score (IS), F1 score (F1), Precision, Brier measure, Matthew Coefficient Correlation (MCC) parameter, and finally Informadness (Inform.) ratio [ 9 ].

3 Investigated Datasets

Our initial investigations focus on applying the developed algorithm on several re-alworld datasets. Five datasets have been used during experiments. Four of them are gathered from the UCI ML repository, while the fifth set has been developed earlier by the authors [ 10 ]. A summary of datasets is presented in Table 1. These datasets have diverse numbers of objects, features and their types as well as classes. To illustrate the proposed methodology, only results for Breast cancer datasets will be presented in details. The first step of the experiment revealed six features, that were recommended as important by all or almost all ranking measures. In Table 2, we can observe that deg-malig, node-caps, irradiat, inv-nodes, breast, and menopause features create a stable and core set of features which have the highest values of seven measures of importance, particularly using RQualityFS measure, introduced in our investigation. In the same table, comparison with importance of contrast values (italic rows and contrast index) is also presented. The most important contrast feature is tumor-size (contrast) for which RQualityFS measure, defined earlier, is equal to 2.34. In this way, we also treated a threshold that separates the core, relevant set of attributes from other less informative attributes. Most of the measures (except SVM weight) used in this approach show that the selected set of features has higher values of these parameters than the gathered threshold value (underlined values). These values are denoted in bold style in Table 2. Hereby, we can observe that different measures give different thresholds.

The second step of the experiment was devoted to evaluation of prediction of the quality of utilized machine learning algorithms described in Section 2. During this step, six different algorithms were applied using the 10-fold cross validation method. The average results for the Breast cancer dataset are shown in Figure 1. This procedure was utilized for two specified sets:

Dataset

CA Sens Spec AUC IS F1 Prec Brier MCC Inform. Breast cancer 0.75 0.59 0.59 0.70 0.08 0.58 0.79 0.37 0.32

0.82 0.79 0.93 0.94 1.32 0.81 0.84 0.27 0.75 0.76 0.70 0.90 0.92 1.09 0.72 0.78 0.34 0.64 0.65 0.56 0.80 0.78 0.69 0.59 0.64 0.50 0.42 0.63 0.54 0.79 0.77 0.52 0.62 0.63 0.51 0.43

Dataset

CA Sens Spec AUC IS F1 Prec Brier MCC Inform. Breast cancer 0.73 0.66 0.66 0.69 0.16 0.66 0.67 0.43 0.33

0.81 0.82 0.93 0.94 1.40 0.82 0.81 0.29 0.75 0.77 0.75 0.91 0.92 1.26 0.75 0.76 0.34 0.67 0.64 0.58 0.80 0.79 0.78 0.61 0.59 0.51 0.39 0.62 0.55 0.79 0.76 0.65 0.58 0.57 0.54 0.36

2. Bermingham , M.L. , Pong-Wong , R. , Spiliopoulou , A. , Hayward , C. , Rudan , I., Campbell , H. , Wright , A.F. , Wilson, J.F. , Agakov , F. , Navarro , P. , Haley , C.S.: Application of highdimensional feature selection: evaluation for genomic prediction in man . Sci. Rep . 5 , ( 2015 )

3. Phuong , T.M. , Lin , Z. , Altman , R.B.: Choosing SNPs using feature selection . Proceedings - 2005 IEEE Computational Systems Bioinformatics Conference , CSB 2005 . pp. 301 - 309 ( 2005 )

4. Zhu , Z. , Ong , Y.S. , Dash , M. : Wrapper-filter feature selection algorithm using a memetic framework . IEEE Trans. Syst. Man, Cybern. Part B Cybern . 37 , 70 - 76 ( 2007 )

5. Nilsson , R. , Pena , J.M. , Bj

okegren

, J., Tegner , J.: Detecting multivariate differentially expressed genes . BMC Bioinformatics . 8 , 150 ( 2007 )

6. Rudnicki , W.R. , Wrzesien´, M. , Paja , W. : All Relevant Feature Selection Methods and Applications . In: Stan´czyk, U. and Lakhmi , C.J. (eds.) Feature Selection for Data and Pattern Recognition . pp. 11 - 28 . Springer-Verlag Berlin Heidelberg, Berlin ( 2015 )

7. Stoppiglia , H. , Dreyfus , G. , Dubois , R. , Oussar , Y. : Ranking a Random Feature for Variable and Feature Selection . J. Mach. Learn. Res . 3 , 1399 - 1414 ( 2003 )

8. Tuv , E. , Borisov , A. , Torkkola , K. : Feature Selection Using Ensemble Based Ranking Against Artificial Contrasts. International Symposium on Neural Networks . pp. 2181 - 2186 ( 2006 )

9. Fawcett , T. : An introduction to ROC analysis . Pattern Recognit. Lett . 27 , 861 - 874 ( 2006 )

10. Hippe , Z.S. , Bajcar , S. , Blajdo , P. , Grzymala-Busse , J.P. , Grzymala-Busse , J.W. , Knap , M. , Paja , W. , Wrzesien , M. : Diagnosing Skin Melanoma: Current versus Future Directions . TASK Q . 7 , 289 - 293 ( 2003 )

11. Hernandez-Orallo , J. , Flach , P. , Ferri , C. : A unified view of performance metrics: translating threshold choice into expected classification loss . J. Mach. Learn. Res . 13 , 2813 - 2869 ( 2012 )