<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Estimation and Feature Selection by Application of Knowledge Mined from Decision Rules Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wieslaw Paja</string-name>
          <email>wpaja@ur.edu.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krzysztof Pancerz</string-name>
          <email>kpancerz@wszia.edu.pl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Faculty of Mathematics and Natural Sciences, University of Rzeszow</institution>
          ,
          <addr-line>Prof. St. Pigonia Str. 1, 35-310 Rzeszow</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Information Technology and Management Sucharskiego Str.</institution>
          <addr-line>2, 35-225 Rzeszow</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Management and Administration Akademicka Str.</institution>
          <addr-line>4, 22-400 Zamosc</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <fpage>57</fpage>
      <lpage>68</lpage>
      <abstract>
        <p>Feature selection methods, as a preprocessing step to machine learning, are effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. However, the recent increase of dimensionality of data poses a severe challenge to many existing feature selection methods with respect to the efficiency and effectiveness. In this work, we introduce a novel concept, relevant feature selection based on information gathered from decision rule models. A new measure for a feature rank based on the feature frequency and rule quality is additionally defined. The efficiency and effectiveness of our method is demonstrated through exemplary use of five real-world datasets. Six different classification algorithms were used to measure the quality of learning models built on original features and on selected features.</p>
      </abstract>
      <kwd-group>
        <kwd>Feature selection</kwd>
        <kwd>feature ranking</kwd>
        <kwd>decision rules</kwd>
        <kwd>dimensionality reduction</kwd>
        <kwd>relevance and irrelevance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the era of the acquisition of vast amounts of data, different domain information
databases, efficient analysis and retrieval of regularity have become an extremely
important task. The issue of classification and object recognition is applied in many fields
of human activity. Data mining is fraught with many aspects which hinder it, like a very
large number of observations, too many attributes, the insignificance of the part of
variables for the classification process, mutual interdependence of conditional variables,
the simultaneous presence of variables with different types, the presence of undefined
values of variables, the presence of erroneous values of the variables, uneven
distribution of categories for the target variable. Thus, the development of efficient methods for
significant feature selection is valid.</p>
      <p>Feature selection (FS) methods are frequently used as a preprocessing step to
machine learning experiments. An FS method can be defined as a process of choosing a
subset of original features so that the feature space is optimally reduced according to
a certain evaluation criterion. Feature selection has been a fruitful field of research and
development since 1970’s and it has been proven to be effective in removing irrelevant
features, increasing efficiency in learning tasks, improving learning performance like
predictive accuracy, and enhancing comprehensibility of the learned results [1].</p>
      <p>
        The feature selection methods are typically divided into three classes based on how
they combine the selection algorithm and the model building: filter, wrapper and
embedded FS methods. Filter methods select features with respect to the model. They are
based only on general features like the correlation with the variable to be predicted.
These methods select only the most interesting variables. Then, a selected subset will
be a part of the classification model. Such methods are effective in computation time
and robust to overfitting [
        <xref ref-type="bibr" rid="ref1">2</xref>
        ]. However, some redundant, but relevant features can
remain unrecognized. In turn, wrapper methods evaluate subsets of features which allow
to detect the possible interactions between variables [
        <xref ref-type="bibr" rid="ref2">1, 3</xref>
        ]. However, the increase in
overfitting risk, when the number of observations is insufficient, is possible.
Additionally, the significant computation time, when the number of variables is large, highly
increases. The third type, called embedded methods, is intended for reducing the
classification of learning. Methods in this group try to combine the advantages of both
methods mentioned previously. Thus, the learning algorithm takes advantage of its own
variable selection algorithm. Therefore, it needs to know initially what a good selection
is, which limits its exploitation [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ].
      </p>
      <p>Kohavi and John [1] observed that there are several definitions of relevance that
may be contradictory and misleading. They proposed two degrees of relevance (strong
and weak) that are required to encompass all notions usually associated with this term.
In their approach the relevance is defined in the absolute terms, with the help of the
ideal Bayes classifier. In this context, a feature X is strongly relevant when removal
of X alone from the data always results in deterioration of the prediction accuracy of
the ideal Bayes classifier. In turn, a feature X is weakly relevant if it is not strongly
relevant and there exists a subset of features S, such that the performance of the ideal
Bayes classifier on S is worse than the performance on S ∪ {X}. A feature is irrelevant
if it is neither strongly nor weakly relevant.</p>
      <p>
        Nilsson et al. [
        <xref ref-type="bibr" rid="ref4">5</xref>
        ] introduced the formal definition of two different feature selection
problems:
1. Minimal Optimal Feature Selection (MOFS) consisting in identification of minimal
set of features to obtain the optimal quality of classification.
2. All Relevant Feature Selection (ARFS)), where the problem is to find all the
variables that may, under certain conditions, improve the classification.
      </p>
      <p>
        There are two important differences between these problems. The first one is
detection of attributes with low importance (ARFS) [
        <xref ref-type="bibr" rid="ref5">6</xref>
        ], which may be completely obscured
by other, more important attributes, from the point of view of the classifier (MOFS).
The second difference is to find the boundary between the variables poorly, but
realistically related to the decision and those for which such a relation is created as a result of
random fluctuations. The formal definition of the problem of all relevant feature
selection (ARFS) as a distinct problem from the classical minimal optimal feature selection
(MOFS), was proposed recently in 2007 [
        <xref ref-type="bibr" rid="ref4">5</xref>
        ].
      </p>
      <p>
        In our research, we used the contrast variable concept to distinguish between
relevant and irrelevant features [
        <xref ref-type="bibr" rid="ref5">6</xref>
        ]. It is a variable that does not carry information on the
decision variable by design that is added to the system in order to discern relevant and
irrelevant variables. Here, it is obtained from the real variables by random permutation
of values between objects. The use of contrast variables was, for the first time, proposed
by Stoppiglia et al. [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ] and then by Tuv et al. [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods and Algorithms</title>
      <p>During experiments the following general procedure was applied:
1. Step 1. Selection of dataset and features for investigation.</p>
      <p>(a) Application of a set of ranking measures to calculate importance for each
feature:
i. With set of contrast features.</p>
      <p>ii. Without contrast features.</p>
      <p>(b) Definition (selection) of the most important feature subset.
2. Step 2. Application of different machine learning algorithms for classification of
unseen objects using the 10-fold cross validation method:
(a) Using all original features.</p>
      <p>(b) Using only selected, important features.
3. Step 3. Comparison of gathered results using different evaluation measures.</p>
      <p>In the first step, a dataset as well as a feature for investigation were defined. Then,
different ranking measures were applied to estimate importance of each feature. In order
to check specificity of the feature selection, the dataset was extended by adding contrast
variables. It means that each original variable was duplicated and its values were
randomly permuted between all objects. Hence, a set of non-informative by design shadow
variables was added to original variables. The variables that were selected as
important more significantly than random, were examined further, using different tests. To
define the level of feature importance, six well-known ranking measures were applied:
ReliefF, Information Gain, Gain Ratio, Gini Index, SVM weight, and RandomForest.
Additionally, our new measure, called RQualityFS, was introduced. It is based on the
frequency of presence of different feature in a rule model generated from an original
dataset and it also takes into consideration the quality of the rules in which this feature
occurs. Rank quality of the i-th feature could be presented as follow:</p>
      <p>n
QAi = X QRj {Ai}</p>
      <p>j=1
QRj =</p>
      <p>Ecorr</p>
      <p>Ecorr + Eincorr
where n is a number of rules inside the model, QRj defines the classification quality of
the rule Rj and {Ai} describes the presence of the i-th attribute, usually it is equal to 1
(the feature occurred) or to 0 (the feature did not occur).</p>
      <p>In turn, the quality of the rule is defined as follows:
(1)
(2)
where Ecorr depicts a number of correctly matched learning objects by the j-th rule
and Eincorr depicts a number of incorrectly matched learning objects by this rule.</p>
      <p>
        During the second step, a test probing the importance of variables was performed
by analyzing the influence of variables used for model building on the prediction
quality. Six different machine learning algorithms were applied to build different predictors
for the original set of features and for selected features: Classification Tree (CT),
Random Forest (RF), CN2 decision rules algorithm (CN2), Naive Bayes (NB), k-Nearest
Neighbors (kNN), and Support Vector Machine (SVM). During this step, a 10-fold cross
validation paradigm was used. Ten known evaluation measures were uti-lized in each
predictor: Classification Accuracy (CA), Sensitivity, Specificity, Area Under ROC curve
(AUC), Information Score (IS), F1 score (F1), Precision, Brier measure, Matthew
Coefficient Correlation (MCC) parameter, and finally Informadness (Inform.) ratio [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 Investigated Datasets</title>
      <p>
        Our initial investigations focus on applying the developed algorithm on several
re-alworld datasets. Five datasets have been used during experiments. Four of them are
gathered from the UCI ML repository, while the fifth set has been developed earlier by the
authors [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ]. A summary of datasets is presented in Table 1. These datasets have diverse
numbers of objects, features and their types as well as classes.
To illustrate the proposed methodology, only results for Breast cancer datasets will be
presented in details. The first step of the experiment revealed six features, that were
recommended as important by all or almost all ranking measures. In Table 2, we can
observe that deg-malig, node-caps, irradiat, inv-nodes, breast, and menopause features
create a stable and core set of features which have the highest values of seven measures
of importance, particularly using RQualityFS measure, introduced in our investigation.
In the same table, comparison with importance of contrast values (italic rows and
contrast index) is also presented. The most important contrast feature is tumor-size
(contrast) for which RQualityFS measure, defined earlier, is equal to 2.34. In this way, we
also treated a threshold that separates the core, relevant set of attributes from other less
informative attributes. Most of the measures (except SVM weight) used in this approach
show that the selected set of features has higher values of these parameters than the
gathered threshold value (underlined values). These values are denoted in bold style in
Table 2. Hereby, we can observe that different measures give different thresholds.
      </p>
      <p>The second step of the experiment was devoted to evaluation of prediction of the
quality of utilized machine learning algorithms described in Section 2. During this step,
six different algorithms were applied using the 10-fold cross validation method. The
average results for the Breast cancer dataset are shown in Figure 1. This procedure was
utilized for two specified sets:</p>
      <p>Dataset</p>
      <sec id="sec-3-1">
        <title>CA Sens Spec AUC IS F1 Prec Brier MCC Inform. Breast cancer 0.75 0.59 0.59 0.70 0.08 0.58 0.79 0.37 0.32</title>
        <p>0.82 0.79 0.93 0.94 1.32 0.81 0.84 0.27 0.75
0.76 0.70 0.90 0.92 1.09 0.72 0.78 0.34 0.64
0.65 0.56 0.80 0.78 0.69 0.59 0.64 0.50 0.42
0.63 0.54 0.79 0.77 0.52 0.62 0.63 0.51 0.43</p>
        <p>Dataset</p>
      </sec>
      <sec id="sec-3-2">
        <title>CA Sens Spec AUC IS F1 Prec Brier MCC Inform. Breast cancer 0.73 0.66 0.66 0.69 0.16 0.66 0.67 0.43 0.33</title>
        <p>0.81 0.82 0.93 0.94 1.40 0.82 0.81 0.29 0.75
0.77 0.75 0.91 0.92 1.26 0.75 0.76 0.34 0.67
0.64 0.58 0.80 0.79 0.78 0.61 0.59 0.51 0.39
0.62 0.55 0.79 0.76 0.65 0.58 0.57 0.54 0.36</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bermingham</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pong-Wong</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spiliopoulou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayward</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rudan</surname>
          </string-name>
          , I.,
          <string-name>
            <surname>Campbell</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <given-names>J.F.</given-names>
            ,
            <surname>Agakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Navarro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Haley</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.S.:</surname>
          </string-name>
          <article-title>Application of highdimensional feature selection: evaluation for genomic prediction in man</article-title>
          .
          <source>Sci. Rep</source>
          .
          <volume>5</volume>
          , (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          3.
          <string-name>
            <surname>Phuong</surname>
            ,
            <given-names>T.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Altman</surname>
          </string-name>
          , R.B.:
          <article-title>Choosing SNPs using feature selection</article-title>
          .
          <source>Proceedings - 2005 IEEE Computational Systems Bioinformatics Conference</source>
          ,
          <string-name>
            <surname>CSB</surname>
          </string-name>
          <year>2005</year>
          . pp.
          <fpage>301</fpage>
          -
          <lpage>309</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          4.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ong</surname>
            ,
            <given-names>Y.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dash</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Wrapper-filter feature selection algorithm using a memetic framework</article-title>
          .
          <source>IEEE Trans. Syst. Man, Cybern. Part B Cybern</source>
          .
          <volume>37</volume>
          ,
          <fpage>70</fpage>
          -
          <lpage>76</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nilsson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pena</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bj</surname>
            <given-names>okegren</given-names>
          </string-name>
          , J.,
          <string-name>
            <surname>Tegner</surname>
          </string-name>
          , J.:
          <article-title>Detecting multivariate differentially expressed genes</article-title>
          .
          <source>BMC Bioinformatics</source>
          .
          <volume>8</volume>
          ,
          <issue>150</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rudnicki</surname>
            ,
            <given-names>W.R.</given-names>
          </string-name>
          , Wrzesien´,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Paja</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          :
          <article-title>All Relevant Feature Selection Methods and Applications</article-title>
          . In: Stan´czyk, U. and
          <string-name>
            <surname>Lakhmi</surname>
          </string-name>
          , C.J. (eds.)
          <article-title>Feature Selection for Data and Pattern Recognition</article-title>
          . pp.
          <fpage>11</fpage>
          -
          <lpage>28</lpage>
          . Springer-Verlag Berlin Heidelberg, Berlin (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          7.
          <string-name>
            <surname>Stoppiglia</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dreyfus</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubois</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oussar</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Ranking a Random Feature for Variable and Feature Selection</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>3</volume>
          ,
          <fpage>1399</fpage>
          -
          <lpage>1414</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          8.
          <string-name>
            <surname>Tuv</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borisov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torkkola</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <source>Feature Selection Using Ensemble Based Ranking Against Artificial Contrasts. International Symposium on Neural Networks</source>
          . pp.
          <fpage>2181</fpage>
          -
          <lpage>2186</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fawcett</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>An introduction to ROC analysis</article-title>
          .
          <source>Pattern Recognit. Lett</source>
          .
          <volume>27</volume>
          ,
          <fpage>861</fpage>
          -
          <lpage>874</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hippe</surname>
            ,
            <given-names>Z.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bajcar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blajdo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grzymala-Busse</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grzymala-Busse</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knap</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paja</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wrzesien</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Diagnosing Skin Melanoma: Current versus Future Directions</article-title>
          .
          <source>TASK Q</source>
          .
          <volume>7</volume>
          ,
          <fpage>289</fpage>
          -
          <lpage>293</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hernandez-Orallo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flach</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A unified view of performance metrics: translating threshold choice into expected classification loss</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>13</volume>
          ,
          <fpage>2813</fpage>
          -
          <lpage>2869</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>