<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Impact of filter feature selection on classification: an empirical study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Uchechukwu F. NJOKU</string-name>
          <email>unjoku@essi.upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Besim BILALLI</string-name>
          <email>bbilalli@essi.upc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto ABELLÓ</string-name>
          <email>aabello@essi.upc.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca BONTEMPI</string-name>
          <email>gianluca.bontempi@ulb.be</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Politècnica de Catalunya</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politècnica de Catalunya</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universitat Politècnica de Catalunya</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Université Libre de Bruxelles</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The high-dimensionality of Big Data poses challenges in data understanding and visualization. Furthermore, it leads to lengthy model building times in data analysis and poor generalization for machine learning models. Consequently, there is a need for feature selection, which allows identifying the more relevant part of the data to improve the data analysis (e.g., building simpler and more understandable models with reduced training time and improved model performance). This study aims to (i) characterize the factors (i.e., dataset characteristics) that influence the performance of feature selection methods, and (ii) assess the impact of feature selection on the training time and accuracy of binary and multiclass classification problems. As a result, we propose a systematic method to select representative datasets (i.e., considering the distributions of several dataset characteristics) in a given repository. Next, we provide an empirical study of the impact of eight feature selection methods on Naive Bayes (NB), Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), and Multilayer Perceptron (MLP) classification algorithms using 32 real-world datasets and a relative performance measure. We observed that feature selection is more efective in reducing training time (e.g., up to 60% for LDA classifiers) than improving classification accuracy (e.g., up to 5%). Furthermore, we observed that feature selection gives slight accuracy improvement for binary classification (i.e., up to 5%), while it mostly leads to accuracy degradation for multiclass classification. Although none of the studied feature selection methods is best in all cases, for multiclass classification, we observed that correlation based and minimum redundancy maximum relevance feature selection methods gave the best results in accuracy. Through statistical testing, we found LDA and MLP to benefit more in accuracy improvement after feature selection than KNN and NB.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Nowadays, organizations store data without proper problem
definition, hoping to gain some insight from the gathered data in
the future. This ‘blind’ data collection leads to datasets that have
a large number of features, often irrelevant or redundant [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
raises challenges in data storage, understanding, visualization,
model training time and explainability. In addition to these, there
is a problem of insuficient number of samples (compared to the
number of features) to build a model of suficiently good quality
© Copyright 2022 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0)
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This is the case in fields like bioinformatics, where
Microarray data could contain thousands of gene expressions as features
but a few tens or hundreds of samples.
      </p>
      <p>
        To fight these challenges, efective techniques to reduce the
dimensionality of datasets are necessary. A common technique
for this is Feature Selection (FS). FS reduces the dimensionality
of a dataset by selecting a subset of the datasets’ features that
are most relevant based on a defined criteria. FS methods can
be classified by their reliance on a learning algorithm to select
the most relevant features into filter , wrapper, and embedded FS
methods [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Filter methods evaluate and select features independent of
any learning algorithm by solely relying on the characteristics of
the dataset to determine the relevance of a feature. This approach
returns a subset of features that is not biased towards any learning
algorithm and is thus highly generalizable but not optimal for
any chosen learning algorithm.</p>
      <p>Wrapper methods select features by going through an
iterative process of selecting a subset of features and evaluating its
quality by the performance (e.g., accuracy) of a model built using
a particular learning algorithm. The cycle continues until
arriving at a subset of features that optimizes the model performance
or a pre-set halt condition is satisfied. Wrapper methods select a
subset of features with the aim of optimizing the performance of
the used learning algorithm hence lacking generalizability. Also,
their iterative approach makes them time ineficient.</p>
      <p>Embedded methods refer to learning algorithms that have FS
encapsulated. Embedded methods are optimal and time-eficient
because they use the target learning algorithm in the selection of
features however, not iteratively. Examples of embedded methods
are tree-based learning algorithms such as decision trees which
prune irrelevant features while building the model, using criteria
like Entropy or Gini (defined by Equations 1 and 9 respectively).</p>
      <p>A FS method is afected by various dataset characteristics, like
its number of features or instances. Therefore, it is imperative to
understand the dataset characteristics that afect the performance
of each FS method. Furthermore, depending on the data
analysis task at hand, FS methods may impact the training time and
accuracy of a model. Thus, the impact could difer for
classification (binary, multiclass), clustering, and regression tasks. Besides
impacting various tasks diferently, their efect could difer per
learning algorithm. Hence, a context-specific understanding of
the impact of FS is essential.</p>
      <p>In this work, we focus on binary and multiclass classification
tasks and study the impact of eight filter FS methods on four
classification algorithms. To select representative datasets from
the OpenML1 repository for the experiments, we use clustering
1https://www.openml.org
discretization to find two clusters each for three dataset
characteristics, namely: number of features, number of instances, and
class balance; while the number of classes fits by default into two
clusters (binary and multiclass). This leads to 16 possible
combinations of the dataset characteristics, and for each combination,
we choose two representative datasets.</p>
      <p>In particular, we make the following contributions:
(1) We identify the factors that influence the runtime of each</p>
      <p>FS method.
(2) We propose a systematic method to select representative
datasets from a given repository.
(3) We perform an extensive set of experiments, with
varying sizes of datasets, and show the scalability of the FS
methods.
(4) We study the efect of FS on the accuracy and training
time of binary and multiclass classification models and for
both model types, we statistically determine the significant
factors responsible for the observed efects and highlight
the diferences in the outcomes of both model types.
(5) We perform an extensive set of experiments with various
feature subset sizes, and investigate how this impacts the
observed changes in the accuracy and training time of the
classifiers after FS.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        The objective of FS is to build simpler and more understandable
models, improve model performance, reduce their training time,
and prepare data that are clean, understandable, and easier to
visualize [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Due to its relevance in Big Data, researchers have
focused on it, especially in recent decades. An essential aspect of
this ongoing research has been: investigating the eficacy of the
many proposed FS methods through benchmarks and
comparisons. Many field-specific FS benchmarks have been conducted
in which the datasets used are all related to a particular subject.
For example, software defection data was used in [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ],
objectbased imagery data in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], biomedical data in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], cancer data
in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], intrusion detection data in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], and text categorization
datasets in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This re-emphasizes the usefulness of FS across
domains. However, there is an intrinsic relationship between
dataset characteristics and the performance of FS methods’ [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
Therefore the dataset homogeneity questions the robustness of
these benchmarks as there is but little deviation in the dataset’s
characteristics. For example, with biomedical datasets, it is often
the case that there are a large number of features and a relatively
small number of instances. A benchmark using only biomedical
datasets could be biased towards this factor.
      </p>
      <p>
        The eficacy of an FS method is usually evaluated using a
learning algorithms’ performance. The choice of learning
algorithms to use depends on the intended analysis task. By this, we
mean tasks like binary or multiclass classification, regression,
clustering, or any other data analysis task. For classification tasks,
past benchmarks [
        <xref ref-type="bibr" rid="ref10 ref22 ref32 ref4">4, 10, 22, 32</xref>
        ] mainly focused on binary
classiifcation, generally considered easier. Some field-specific
benchmarks [
        <xref ref-type="bibr" rid="ref16 ref18 ref2 ref21">2, 16, 18, 21</xref>
        ] have also focused on multiclass classification.
However, there is a lack of comparative studies highlighting the
distinct impact of FS on both analysis tasks.
      </p>
      <p>
        Using learning algorithms to evaluate and compare FS
methods means comparing the performance of models built with the
features selected by each FS method. The most commonly used
metric is the model accuracy [
        <xref ref-type="bibr" rid="ref16 ref21 ref22 ref4">4, 16, 21, 22</xref>
        ]. Some others include:
ROC [
        <xref ref-type="bibr" rid="ref19 ref2">2, 19</xref>
        ], AUC [
        <xref ref-type="bibr" rid="ref10 ref2 ref32">2, 10, 32</xref>
        ], F-Score [
        <xref ref-type="bibr" rid="ref19 ref2">2, 19</xref>
        ], Precision [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Recall
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and Kappa index [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Indeed, one can compare the
performance of various FS methods using any of these metrics.
However, these metrics alone, do not take into account the relative
impact (i.e., in comparison to the models built without FS). [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
benchmarks 23 models, 22 built using the features selected by 22
FS methods, and one built without FS. Using the mean accuracy
(from 10-fold cross-validation), they compared the FS methods by
counting the number of datasets (out of 16) on which the mean
accuracy of each model is higher than others.
      </p>
      <p>
        In this work, we first propose a systematic method to find
representative datasets in a given repository in order to remedy the
problem of ’non-representative’ or ’homogeneous‘ datasets. Next,
using this method we select 32 datasets from the OpenML[
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]
repository and benchmark eight filter FS methods with a relative
measure, that is calculated as the change in training time (16)
and accuracy (17) after FS. We use four classification algorithms
in binary and multiclass classification tasks. Finally, we validate
the significance of our results using statistical tests.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>BACKGROUND</title>
      <p>In the following, we provide the necessary definitions of the
concepts appearing throughout this work.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Preliminaries</title>
      <p>Entropy quantifies the uncertainty of a discrete random variable
 with sample space X = {1, 2, . . . ,  } and is defined as:
 ( ) = −</p>
      <p>( ) log( ( )),
∑︁
 X
where  ∈ X and  ( ) is the prior probability of  over all
possible outcomes of  i.e. X.</p>
      <p>Conditional entropy measures the remaining uncertainty
of X given another discrete variable Y with sample space Y =
{1, 2, . . . ,  } and is defined as:
 ( | ) = −
∑︁
 Y
 (  )
∑︁
 X
 ( |  ) log( ( |  )),
(2)
(1)
(3)
(4)
(5)
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Classification Algorithms</title>
      <p>The classification task is a common problem in practice and
is categorized into (i) binary classification where subjects are
assigned to one of two classes; for instance the bank wants to
know if a loan applicant is likely to default or not before issuing a
loan and, (ii) multiclass classification where subjects are assigned
to one of more than two classes, as in classifying the image of a
plant into one of many species.
where  ( |  ) is the conditional probability of  given   .</p>
      <p>Mutual information is the measure of information shared
between two discrete random variables  ,  and is defined as:
 ( ;  ) =  ( ) −  ( | ).</p>
      <sec id="sec-5-1">
        <title>Conditional mutual information quantifies the amount of</title>
        <p>information shared between two discrete random variables  , 
when a third discrete random variable  is known. It is defined
as:</p>
        <p>( ;  | ) =  ( | ) −  ( | ,  ).</p>
        <p>Bayes’ theorem states that given two events  and , with
 () ≠ 0, then:
 (|) =
 ( |) () .</p>
        <p>()</p>
        <p>In this work, four classification algorithms that do not have FS
embedded, and that are applicable to both binary and multiclass
problems were used. Their details are discussed below.</p>
        <p>
          Naive Bayes (NB) is a collection of classification algorithms
that are based on Bayes’ theorem and the assumption that every
pair of features  ,   in a dataset are conditionally independent
given the class feature  i.e.:
 ( ∩   | ) =  ( | ) (  | ).
(6)
NB classifies an instance to the most probable class value  given
its feature vector ⟨1, 2, . . . ,  ⟩ and has a time complexity of
 (.) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for a dataset with  features and  instances.
        </p>
        <p>
          K-Nearest Neighbours (KNN) is a commonly used
classiifcation algorithm that works on the ‘bird of the same feather
lfock together’ principle. The class of an unknown instance is
determined by that of its  closest neighbors whose classes are
known. KNN is considered a lazy algorithm because it does not
build a classifier once for reuse, rather it scans through the
training data to determine the class of an unknown instance each time
it is called [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], and has a time complexity of  (.) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          Linear Discriminant Analysis (LDA) is a classic
classification algorithm that uses linear decision boundary(ies) to build a
classifier for unknown instances. Its closed-form solution 2 with
no need for hyperparameter tuning makes it easier to compute
and it has demonstrated good performance in practice [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Given
an unknown instance  , the probability of  belonging to a class
 of the target feature  with  classes is estimated by:
 () ( |)
 ( =  | =  ) = Í
=1  ( ) ( | )
.
        </p>
        <p>
          However,  ( |) can be estimated with a Gaussian distribution
function which when substituted leads to a simplified
discriminant function for class  given instance  defined as:

 ( ) =  2 − 22 + ln( ()),
2

where  is the mean value of  for the class  and 2 is the
variance across all inputs  [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. LDA works under the assumption
that the data is normally distributed and that the variance of all
features are equal. It can also be used for dimensionality reduction
by projecting the training data into a linear subspace such that the
separability between classes is maximized [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. LDA has a cubic
time complexity of  (.. +  3) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], where t is the min(, ).
        </p>
        <p>Multilayer Perceptron (MLP) is a type of feedforward
artificial neural network consisting of three layers of nodes; an
input layer, output layer3, and an arbitrary number of hidden
layers. A full pass of data (epoch) through the MLP comprises
forward and backward passes. In the forward pass, data is passed
from the input layer through the hidden layer(s) to the output
layer adjusting the weights of the nodes based on the ground
truth. The backward pass goes in the opposite direction from the
output layer to the input layers to minimize the error of the MLP.
The classification model is built by several epochs of the training
data through the MLP after which it is used to classify unknown
instances.</p>
        <p>
          The time complexity of MLP,  (..ℎ . .) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], depends on
its architecture and configuration, i.e. the number of iterations ,
output neurons , and hidden layers  each with ℎ neurons. MLP
2A closed-form solution (expression) is any formula that can be evaluated in a finite
number of standard operations.
3The number of classes in the target feature is the number of nodes in the output
layer.
is applied in various fields including speech and image
recognition, and natural language processing. [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] This classifier has the
option of regularization which can be considered embedded FS,
however this option was disabled for this work.
3.3
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Feature Selection Algorithms</title>
      <p>The generalizability of filter methods allow for easier
comparison of their impact on multiple learning algorithms since the
resulting subset of features is derived independently of any
learning algorithm (unlike wrapper and embedded FS methods). Also,
iflter methods are less time-consuming than wrapper methods
because they do not require an iterative training of the learning
algorithm to select a subset of dataset features. For these reasons,
eight filter methods were studied in this work and presented
below.</p>
      <p>
        Gini index is a very common statistical measure used to
quantify the ability of a feature to separate instances between classes
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. A feature  with  distinct values is scored as:
 ( ) = min  ( ) (1 −
      </p>
      <p>( | )2)
∑︁
=1
∑︁
=1
+ (  ) (1 −
 ( |  )2) ,
(7)
(8)
where  is the ℎ (≤   ) class, and  and  
denotes instances with  value ≤  and &gt;  respectively where
 is some ℎ value of  , this implies that  separates the dataset
into  and  . The maximum gini score in binary
classification is 0.5 and lower values imply higher relevance.</p>
      <p>
        ReliefF is a similarity based method which extends the classic
relief method, not just for binary, but also multi-class
classification [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The idea of this method is to select features that
maximally distinguish between instances that are near to each
other [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. The score of a feature  is defined as:
(9)
 ( ( , ) −  (, ))
 ( ( , ) −  (, ))),
(10)
   ( ) =
1 ∑︁
 =1 (−     ( )
1
      </p>
      <p>∑︁
+
∑︁
1
 ()</p>
      <p>∑︁
≠ ℎ   1 −  ()    ( ,)
where  is the number of randomly selected instances (≤ ),
  (  ) of size   is the set of instances closest to instance   and
in the same class,   ( , ) with size ℎ   is the set of instances
closest to   and in class ,  () is the probability of instances
belonging to class , and  (.) is the distance estimator between
two instances.</p>
      <sec id="sec-6-1">
        <title>Spectral Feature Selection (SPEC) is a similarity-based</title>
        <p>
          method that works for supervised and unsupervised tasks [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
SPEC selects features from a dataset  by the following three
steps [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]: (i) it builds a similarity set  from  using a similarity
measure, e.g., Radial Basis Function (RBF) kernel, (ii) it extracts
a representative graph  from , and (iii) lastly, the structural
information of  obtained from its spectrum4 determines the
relevance of each feature.
        </p>
        <p>
          Conditional Mutual Information Maximization (CMIM)
is an information theoretical-based method which iteratively
picks features that maximize mutual information to the class
feature conditioned on the already selected features [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. Where
 is the subset of already selected features,  the class feature, and
4The spectrum of a graph refers to its eigenvalues and their multiplicities.
(11)
(12)
(13)
(14)
 an unselected feature, the feature score of  can be defined
as:
  ( ) = min  ( ;  |  ).
        </p>
        <p>This approach ensures that the selected features are good
predictors of the class label but minimally redundant.</p>
        <p>
          Minimum Redundancy Maximum Relevance (mRMR) is
also an information theoretical-based method proposed to select
features that balance relevance and redundancy. A feature  is
scored as:
1 ∑︁
 ( ) =  ( ;  ) − | |    ( ;   ),
where the mutual information between the feature and
target feature  ( ;  ) is the measure of relevance while
|1 | Í   ( ;   ) tells the features redundancy compared to
already selected features [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>Joint Mutual Information (JMI) is an information
theoretical-based method which iteratively selects features that
complement already selected features with respect to the class
feature. Its score is defined as:
  ( ) =</p>
        <p>( ;  |  ).
∑︁</p>
      </sec>
      <sec id="sec-6-2">
        <title>Eficient and Robust Feature Selection (RFS) is a sparse</title>
        <p>
          learning based method that uses a joint 2,1-norm minimization
on both the loss function and the regularization. This approach
is more robust to noise and achieves group feature sparsity [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
The objective function for RFS is:
        </p>
        <p>min || −  ||2,1 +  || ||2,1,
where  ,  , and  are the matrix of feature weights, matrix
of feature values and one-hot encoding of the target feature,
respectively and  is a parameter of the regularization term
|| ||2,1.</p>
        <p>
          Correlation-based Feature Selection (CFS) is a statistical
based method with the basic idea of selecting a feature subset
 in which features are highly correlated with the class and
uncorrelated with other selected features [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This implies that
redundant features are less likely to be selected. The CFS score
of a feature subset dubbed "merit" is defined as:..
        </p>
        <p>() = √︃</p>
        <p>| | 
| | + | | (| | − 1)  
,
(15)
where   is the average feature-class correlation and    is the
average feature-feature inter-correlation.</p>
        <p>
          CFS begins with an empty set of features, computes the
symmetric uncertainty of each feature and then greedily adds features
to the set using forward best-search heuristic strategy [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. It
stops after five consecutive searches with improvement in merit
.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>METHODOLOGY</title>
      <p>To evaluate the impact of the presented FS methods on
classification, we followed a systematic approach to define the
components of the experiments carried out in this work and to select
the datasets used. In what follows, we present the details.</p>
      <p>Full
Dataset
Classification
Algorithm</p>
      <p>Filtered
Dataset
Legend:</p>
      <p>Data</p>
      <p>Process</p>
      <p>Baseline</p>
      <p>Flow</p>
      <p>
        Target
Flow
The number of features to select, parameters of FS methods and
classification algorithms, and dataset characteristics culminate
into more factors than we can control in this work. Therefore, it
is essential to clearly define the components of the experiments,
as failing to do so can lead to mistakes (such as
unrepresentative workload, erroneous analysis, or too complex analysis). We
followed the systematic approach for performance evaluation
proposed in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to avoid these common mistakes.
      </p>
      <p>In what follows, we present the details of the ten steps in the
systematic approach for performance evaluation.</p>
      <p>(1) Goals and System Definition</p>
      <p>In this work, the system under performance evaluation
consists of the datasets (full and filtered), FS methods, and
classification algorithms as shown in Figure 1. Within this
system, we define two subsystems: (i) the target which
represents the flow ⟨Full Dataset → FS Method →
Filtered Dataset → Classification Algorithm ⟩, where FS is
performed before model building and (ii) the baseline
which represents the path ⟨Full Dataset → Classification
Algorithms⟩ where the full dataset, without FS, is used to
build the model.</p>
      <p>The experiments we execute benchmark the target against
the baseline subsystem for all the combinations of dataset,
FS method, and classification algorithm in this work.
The goals of the experiments conducted are to:
• Measure the FS runtime for each dataset.
• Measure the model training time of the classifiers built
in the baseline and target subsystems.
• Measure the performance of the classifiers built in the
baseline and target subsystems.
(2) Services and Outcomes</p>
      <p>The FS methods and learning algorithms deliver the
services of the system. Both processes take a dataset as input;
the former outputs a subset of the datasets’ features while
the latter produces a classifier model.
(3) Metrics</p>
      <p>Metrics are the criteria used to evaluate the performance
of a system. The metrics used to evaluate the system under
performance evaluation (Figure 1) are:
• FS runtime, which is the time taken to obtain the
feature subset/ranked features of a dataset.
• Model Training Time (TT) change is the ratio of the
diference in time between training the baseline and
target classifiers to the baseline classifier training time.
It is expressed as:</p>
      <p>−   
 ℎ5 = . (16)
 
• Classifier accuracy change is the ratio of the
diference between the balanced accuracy of the target
classiifer and the baseline classifier to the balanced accuracy
of the baseline classifier expressed as:
 ℎ6 =
   −  
 
.
(4) Parameters</p>
      <p>Several characteristics of the system components (datasets
and algorithms) afect the performance of the system.
These are called parameters and are categorized into
workload and system parameters. The workload parameters are
associated with the dataset and algorithms and include:
Dataset parameters
• Number of features ()
• Number of instances ()
• Number of classes ()
• Dimensionality, which is the feature-instance ratio of a
dataset defined as:</p>
      <p>= . (18)

• Class Balance (), this is the distribution of instances
amongst the classes in the target variable. It is measured
by the Shannon entropy which is defined as:
(17)
(19)
(20)
• Processor speed
• RAM/Disk size
• Operating system context switching overhead
(5) Factors to Study</p>
      <p>
        The number of parameters controlled in an experiment
determines its workload and completion time.
Considering the exponential growth of the experimental design
space with an increasing number of parameters, we
considered the five parameters: number of features, instances,
and classes, class balance, and feature subset size, as more
important because they control the time complexity of
the FS methods as shown in Table 4, as well as
classification accuracy [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Also, some other workload parameters
are derived from the selected factors or are known to not
influence the results of FS methods, for instance,
dimensionality is simply the feature-instance ratio and most FS
methods do not factor in the average correlation of
features during FS, but rather assess features individually.
Hence, dimensionality and average feature correlation are
not included in the study.
      </p>
      <p>Factors such as processor speed or storage size,
associated with the hardware on which the experiments were
executed were constant as one machine was used for all
executions. However, to mitigate the efects of context
switches, especially for execution time, we employed 10
fold cross-validation to ensure accurate results. Lastly, the
default hyper-parameters of the classification and FS
algorithms were used since the same setting was applied
to both baseline and target classifiers. Controlling all
parameters for all algorithms would not only explode the
design space but also shift our focus to hyper-parameter
optimization, which is outside the scope of this work.
(6) Evaluation Technique</p>
      <p>The experiments were programmed with Python, which
has the scikit-feature7 repository, scikit-learn8, and time9
libraries that implement the studied FS and learning
algorithms as well as allow us to measure execution times.
(7) Workload</p>
      <p>The datasets used in the evaluation of the system
constitute its workload. Considering four datasets characteristics
with two clusters each (see Section 4.2), we systematically
selected 32 real-world datasets from the OpenML
repository for this work. We considered only datasets with
no missing values as candidates to save time in the
preprocessing stage and avoid the potential efect of missing
values on the classifier’s performance (especially in cases
with missing values being dropped. Lastly, datasets with
only numeric values were considered due to the input
constraint by the FS methods studied.
(8) Experiment Design</p>
      <p>Out of the eight FS methods studied, Gini, RFS, ReliefF,
SPEC, and CFS methods return a ranking of the features
in order of relevance, while JMI, MRMR, and CMIM
return a subset of the features according to the specified
size. Five possible subset sizes were studied, specifically
(    ) , for i ∈ [0.5, 0.6, 0.7, 0.8, 0.9]. This
means that a total of 20 runs (Table 1) were required for the
eight FS methods and two classifiers (baseline and target)
for each of the four classification algorithms. Therefore
 = −
Í  log(  ) ,
=1</p>
      <p>log 
where there are  instances in class . A perfectly
balanced dataset with equal number of instances in each
class has a class balance of one while an extemely
unbalanced dataset has a class balance of zero.
• Class Entropy, this measures the entropy (Equation 1)
of the classes in the target feature.
• Average feature correlation, this measures the
interdependence between all features and is quantified by
 =
1 ∑︁ ∑−︁1 ∑︁
 =1 =1 =+1</p>
      <p>
        |  |,
where   is the Pearson’s correlation coeficient of
features  and  , and  is the total number of   ’s summed.
A  value of zero indicates independence of all features
while a value of one indicates high correlation between
features and thus feature redundancy [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <p>
        Algorithm parameters
• Feature subset size (), which is the number of features
to be selected; afects the results derived from the
system.
• Hyper-parameters, this refers to the several parameters
in FS and classification algorithms that can be set to
differing values to control the model building process [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
These parameters afect the output of the algorithms
and thus the system.
      </p>
      <p>Lastly, the system parameters are the characteristics
of the hardware used to execute the experiments for the
system performance evaluation and include among others:
5There is an expected decrease in runtime i.e., Baseline TT &gt; Target TT.
6There is an expected increase in accuracy i.e., Target Accuracy &gt; Baseline
Accuracy.</p>
      <sec id="sec-7-1">
        <title>7https://github.com/jundongl/scikit-feature 8https://scikit-learn.org/stable 9https://docs.python.org/3</title>
        <p>with all 32 datasets, a full factorial experimental design
with 5,120 (32*20*8) experiments will be used.
(9) Data Analysis and Interpretation</p>
        <p>To derive insights from the results of the experiments,
Z-test, Friedman, and Nemenyi statistical methods were
used to analyze our results (i.e., measured metrics). For
each of these tests, a null and alternative hypothesis were
formulated and based on the test result, we either reject
or fail to reject the set null hypothesis.
(10) Results</p>
        <p>Lastly, conclusions from the analysis of the results of the
experiments are presented and discussed in Section 5.
Several data repositories provide open data which enables
research; OpenML was used in this work. Considering the input
constraint of some of the algorithms studied, not all datasets in
OpenML (over 21,000) could be used.</p>
        <p>Available datasets were pruned based on the following eligibility
criteria:
• Number of features ≤ 1.000
• Number of instances ≤ 10.000
• No missing values
• Numerical types (excluding timestamps)
• Single target variable
After pruning, there were 380 candidate datasets available and
the metadata (number of features, instances, classes, and the
class balance) of these datasets showed a wide range of values
for the factors to be studied; for example, the number of features
of the datasets ranged from 2 to 971. Also, we noticed repeated or
highly identical datasets. So, we selected a subset that properly
represents the available datasets while reducing redundancy.</p>
        <p>Using the clustering discretization method which takes into
account the intrinsic distribution of data values, each of the
factors in the metadata were discretized into two bins. This was
done for both the metadata values and the logarithm of the values
as shown in Tables 2 and 3 respectively. It is indeed possible to
discretize into more than two bins, however, this will lead to an
exponential increase in the full factorial of the design space. With
two bins and 4 dataset factors, we had 24 possible factor
combinations all of which need representative datasets. However, with
three bins, only nine out of 81 combinations had representative
datasets. Increasing the number of bins will therefore result in
more factor combinations with no representative datasets.</p>
        <p>The discretization results from applying clustering on the
log values was used to guide the selection of the datasets from
OpenML because it best describes the distribution of the datasets
in the repository and contained candidate datasets for all 16
combinations of the studied factors.</p>
        <p>After the discretization, two datasets were selected randomly
for each of the 16 factor combinations with a total of 32 datasets.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Factor</title>
      </sec>
      <sec id="sec-7-3">
        <title>Number of instances</title>
      </sec>
      <sec id="sec-7-4">
        <title>Number of features</title>
      </sec>
      <sec id="sec-7-5">
        <title>Class balance</title>
        <p>
          Bin
To begin, all features of the dataset were discretized using the
uniformed discretization method; setting the number of bins to
max{min{ 3 , 10}, 2} where  is the number of instances in the
dataset [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Discretizing the features is done to meet the input
constraint of most FS methods which work with only discrete
features.
        </p>
        <p>Next, the dataset was split into ten partitions; nine partitions
formed the training set, while one partition was used as the test
set. After this split, the entire training set was used for FS as
well as building the base classifier while the training data
containing only the selected subset of features were used to build
the target classifier. Both classifiers were tested on the test set
and the accuracy was measured. The FS, model building, and
testing are repeated ten times for each dataset (i.e., 10-fold
crossvalidation); each time a new partition is used as the test data
while the remaining nine are used as the training data. Figure 2
gives an overview of the described execution, from discretization
to measure recording. At the end of each of the 10-fold
execution, the median runtime and average balanced accuracy were
recorded. The median was used in the runtime for robustness
against possible variations due to context switching, and since
we use a diferent training set for each fold, there might be slight
diferences in the accuracy, hence the average serves as a global
accuracy.</p>
        <p>Due to the poor stability of some FS methods, diferent
features might be selected with a change of data samples. Hence,
FS and model building were executed together for each iteration
to ensure that the performance evaluation is done on the same
data. The Python source code and datasets information for the
experiments are publicly available on Github10.
5</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>RESULTS</title>
      <p>There is an expectation for improved model runtime and accuracy
after FS. However, FS itself could be time expensive leading to
an overhead that causes the total time needed for FS and model
10https://github.com/F-U-Njoku/filter-fs-impact-on-classification</p>
      <p>Discretization</p>
      <p>Split
1
Testing
Set
ii
n
G
New Testing</p>
      <p>Set</p>
      <p>Training</p>
      <p>Set
9
M
I
M</p>
      <p>C</p>
      <p>Feature Selector
Training Time</p>
      <p>Classifier</p>
      <p>Data
ndeDocument
g
e</p>
      <p>L Process
10-Fold Cross Validation</p>
      <p>New Training</p>
      <p>Set
Balanced</p>
      <p>Accuracy
F
lif
e
e
R</p>
      <p>C
E
P
S</p>
      <p>R
M
R
M</p>
      <p>I
M
J</p>
      <p>S
F
R</p>
      <p>S
F
C</p>
      <p>KNN</p>
      <p>LDA</p>
      <p>MLP
building to increase beyond the time it takes to build the model
without FS. In this section, we present and discuss the results
of our experiments on eight FS methods and four classification
algorithms using 32 datasets with feature subset size set to 0.5,
0.6, 0.7, 0.8, and 0.9 where  is the number of features.
The runtime of a FS method is afected by various factors and in
difering proportions. Some are influenced more by the number
of features, others by the number of instances, and so on. The
time complexity of the studied FS methods are presented in Table
4, this shows the factors which afect the runtime of each method.</p>
      <p>Gini has the lowest runtime and is influenced mainly by the
feature and class size. ReliefF and SPEC are controlled by the
instance and feature size, with the former having a quadratic
impact on the runtime. CMIM, MRMR, and JMI all have the
same time complexity, with feature and subset size being the
factors that predominantly influence their runtime. Lastly, RFS
and CFS have the longest runtimes due to the numerous matrix
multiplication operations required.
)
s
dn 3,000
o
c
e
S
(
e
itm 2,000
n
u
R
1,000
0
0.1
all datasets; this is due to the multiple computations of pairwise
correlations between features required by this method. Second
to CFS was RFS which took an average of 14 minutes. The other
methods took less than 50 seconds on average, while Gini was
the most eficient, taking less than a second on average.</p>
      <p>Although FS is time expensive for some methods, the
advantage of filter methods is their generalizability. The selected
features can be used repeatedly across various learning algorithms
for model selection since they were obtained independently of
any learning algorithm.
The dataset size, particularly the feature and instance sizes,
control the time it takes to build a classifier, as discussed in Section 3.
For learning algorithms that build classifiers after training, such
as NB, LDA, and MLP, FS reduces the model training runtime
as expected. For LDA, we recorded on average for all datasets
(i.e., both binary and multiclass classification), over 50% training
time reduction when the feature subset size,  was set to 0.5
and over 13% for  set to 0.9 as shown in Figure 4a. In this case,
as expected, selecting fewer features led to faster model building.</p>
      <p>On the other hand, a lazy learning algorithm like KNN does
not build a classifier at the point of training; rather, it stores useful
metadata that is used at the point of classification. Hence, KNN
does not benefit from FS in the expected manner. As shown in
Figure 4b, KNN follows a pattern diferent from other algorithms
and gives the best training time improvement with  set to 0.8
in most cases. Therefore, the benefit of reduced model runtime
after FS depends on the learning algorithm.
Accuracy is one of the methods used to measure the performance
of a model and due to the presence of unbalanced datasets in the
workload, we specifically use the balanced accuracy.</p>
      <p>Since one of the objectives for FS is to build models with
improved performance that generalize the data better, the
expectation is to have an improvement in the model accuracy after FS.
However, as shown in Figure 6, the change in accuracy after FS
difers for binary and multiclass classifications. For binary
classiifcation, there was an average improvement of the accuracy after
FS for all methods except SPEC. In contrast, reducing the
number of features generally led to accuracy decrease for multiclass
0.5
0.6
0.7
0.8
0.9
classification. Although SPEC is applicable to both supervised
and unsupervised learning tasks, it gave the worst performance
for both binary and multiclass classification as shown in Figures
5 and 6; this is because it performs FS independent of the target
variable; hence, the selected features by SPEC tend to be less
relevant compared to those selected by other methods.</p>
      <p>Also shown in Figure 5, CFS yielded good results in most
cases; this is because it uses the correlation of features to the
target feature as well as the pairwise correlation of features for
FS. This is useful because it ensures minimal redundancy in the
selected features. For multiclass classification, CFS and MRMR
gave the best result (i.e. the least accuracy degradation).
Therefore, although FS generally results in a decrease in multiclass
classification accuracy, in the case that the dataset is too large
and FS is indeed needed, CFS and MRMR methods are suggested.</p>
      <p>Since the improvement in accuracy is diferent for binary and
multiclass classifications, we analyze the change in accuracy for
both classification types separately with the aim of identifying
the factors that most influence the accuracy change.</p>
      <p>5.3.1 Statistical Analysis.</p>
      <p>By visualization, we appreciate a diference in the accuracy
change obtained for binary and multiclass classifications (Figure
5) and this is irrespective of the FS method used (Figure 6). We
further our investigation by testing the significance of this
observation statistically for the subset size set to 0.7 and an alpha
level of 0.05.</p>
      <p>We begin by making the following hypothesis:
• 0: The classification type does not influence the change
in accuracy derived from applying FS.
• 1: The classification type influences the change in
accuracy derived from applying FS.</p>
      <p>To test the aforementioned hypothesis, we consider two
independent groups. The binary and multiclass groups, each made up
of accuracy change between baseline and target classifiers after
FS on the 16 datasets (512 result points). Testing the previous
hypotheses using Z-test, the results show the impact of FS on
binary classification in terms of accuracy change ( = 0.0058,  =
0.0506,  = 508) was significantly higher than on multiclass
classification ( = −0.0627,  = 0.1483,  = 512),  = 9.2589,
 &lt; 0.001. Therefore, the classification type indeed influences
the change in accuracy derived from applying FS.</p>
      <p>
        Besides the type of classification, the various classification
algorithms could interact with the FS methods to give diferent
results in terms of accuracy change. Hence, using Freidmans’
test, we evaluate the accuracy changes for binary and multiclass
classification separately. Friedman (  ) test [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] ranks the accuracy
change of the four classifiers for each dataset-FS method pair with
the highest rank being 1 and the smallest rank 4. In the case of
ties, the average rank is used. We test the null hypothesis that all
algorithms benefit equally from FS. Therefore there should be an
equal ranking of the derived accuracy change for each dataset
and FS selection pair.
      </p>
      <p>For binary classification, using 16 datasets 11, eight FS methods,
and four classification methods, the  -statistic is distributed
according to the  distribution with (4 − 1) and (4 − 1) (127 −
1) = 378 degrees of freedom. The critical value of F(3,378) is
2.6285, and the derived F-statistic is 2.47405553 giving  = 0.0612.
Therefore, we fail to reject the null hypothesis and conclude that
the studied classification algorithms similarly gain from FS for
binary classification.</p>
      <p>For multiclass classification, using 16 datasets, eight FS
methods, and four classification methods, the  -statistic is distributed
according to the  distribution with (4 − 1) and (4 − 1) (128 − 1) =
381 degrees of freedom. The critical value of F(3,381) is 2.6283 and
the derived F-statistic is 11.41980269 giving  &lt; 0.001 . Therefore
we reject the null hypothesis and proceed to do a post-hoc test.
We use the Nemenyi test with critical value 2.569 to compare the
classifiers to each other, which yields a Critical Diference (CD)
of 0.4145. Using the CD, we identify two groups of algorithms
as shown in Figure 7: LDA and MLP benefit more in terms of
accuracy improvement after FS compared to KNN and NB.
Therefore, for multiclass classification, the impact of FS on accuracy is
afected by the choice of the algorithm.
11For CFS FS method, the largest dataset was left out as it was not completed within
reasonable time.
0.5
0.6
0.7
0.8
0.9
From the results of our experiments presented previously and
in the supplementary material12, it is clear that there is no FS
method that is best for all cases, i.e., all datasets, tasks, and
learning algorithms. Choosing a feature selection method is, therefore,
use-case dependent. The first step is to define the analysis task for
which we considered binary and multiclass classification. Feature
selection behaves diferently for these tasks, particularly in terms
of how it impacts model performance and execution time.</p>
      <p>For multiclass classification, feature selection does not
improve the model performance (i.e., accuracy); on the contrary, it
degrades it; and the smaller the feature subset size, the worse the
decline in the performance. For this reason, our analysis suggests
that feature selection should be avoided for multiclass
classification. Yet, in case necessary (e.g., for performance/training time
reasons), then either CFS (for small-sized datasets) or MRMR (for
large-sized datasets) can be used, because they gave the least
12https://github.com/F-U-Njoku/filter-fs-impact-on-classification
Gini</p>
      <p>ReliefF SPEC</p>
      <p>CMIM</p>
      <p>MRMR</p>
      <p>JMI</p>
      <p>RFS</p>
      <p>CFS
accuracy degradation and improvement in a few cases compared
to the other studied methods.</p>
      <p>In the case of binary classification, we observed improvements
in the classification accuracy, which depends on the chosen
classiifcation algorithm. For all five feature subset sizes that we set, CFS
and MRMR increased the model accuracy for the NB classifier.
For the KNN classifier, CFS, ReliefF, Gini, and CMIM improved
accuracy in that order. For the LDA classifier, CFS, ReliefF, Gini,
and CMIM improved accuracy in that order , and lastly, the MLP
classifier with CFS led to improvement in accuracy. It is tempting
to assume that CFS is the best choice to make when the goal
is to improve performance for binary classification. However,
we must recall from Figure 3 that this method is
computationally expensive and is unsuitable for large datasets. Hence these
recommendations must be considered taking into account the
dataset size and FS methods complexity as presented in 5.1.</p>
      <p>Finally, whether binary or multiclass classification, Gini is
a time-eficient method, and especially for very large datasets,
we propose this feature selection method to improve runtime.
Figure 8, summarizes our recommendations based on the results
of our experiments and given the goal, task, and classification
algorithm.</p>
    </sec>
    <sec id="sec-9">
      <title>6 CONCLUSIONS AND FUTURE WORK</title>
      <p>In this work, we studied the impact of eight filter FS methods on
four classification algorithms using 32 real-world datasets. We
Multiclass</p>
      <p>Goal
1. CFS
2. MRMR
1. CFS
2. Gini
3. ReliefF
4. MRMR
1. CFS
2. ReliefF
3. Gini
4. CMIM</p>
      <p>CFS
found that on average, FS improves the accuracy for binary
classification, however, for multiclass classification, feature selection
leads to a decline in model accuracy.</p>
      <p>With regards to FS runtime, the Gini FS method was the most
eficient with an average runtime of less than a second making
it the recommended FS method when minimizing runtime is
the objective. However, when the goal is to improve classifier
accuracy, none of the studied FS methods gives best accuracy
improvement in all cases. Although FS on multiclass classification
on the average led to a degradation in classifier accuracy, we
observed that CFS and MRMR methods gave the best result in
that they yielded the least degradation in accuracy and led to
improvement in some cases.</p>
      <p>As further work, we plan to extend this work to larger datasets
with more clusters of dataset factors using multiple repositories
to facilitate finding representative datasets. Also, we intend to
investigate the dependence of multiclass classification performance
degradation after FS on the multiclass classification strategy and
the KNN training time degradation with the reduction in the
number of features for some cases.</p>
      <p>Since FS applies to supervised and unsupervised tasks, this
work can be extended to regression and clustering tasks. Lastly,
several existing methods have polynomial runtimes and do not
scale eficiently; more eficient implementations of these methods
aiming for linear or sub-linear time complexity are highly needed,
especially with Big Data being ubiquitous..
7</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>The project leading to this publication has received funding from
the European Commission under the European Union’s Horizon
2020 research and innovation programme (grant agreement No
955895).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Hajar</given-names>
            <surname>Alla</surname>
          </string-name>
          , Lahcen Moumoun, and
          <string-name>
            <given-names>Youssef</given-names>
            <surname>Balouki</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>A Multilayer Perceptron Neural Network with Selective-Data Training for Flight Arrival Delay Prediction</article-title>
          .
          <source>Scientific Programming</source>
          <year>2021</year>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Yindalon</given-names>
            <surname>Aphinyanaphongs</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lawrence D Fu</surname>
            ,
            <given-names>Zhiguo</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            , Eric R Peskin, Efstratios Efstathiadis, Constantin F Aliferis,
            <given-names>and Alexander</given-names>
          </string-name>
          <string-name>
            <surname>Statnikov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>65</volume>
          ,
          <issue>10</issue>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Verónica</given-names>
            <surname>Bolón-Canedo</surname>
          </string-name>
          ,
          <source>Noelia Sánchez-Maroño, and Amparo AlonsoBetanzos</source>
          .
          <year>2013</year>
          .
          <article-title>A review of feature selection methods on synthetic data</article-title>
          .
          <source>Knowledge and information systems 34</source>
          , 3 (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Bommert</surname>
          </string-name>
          , Xudong Sun, Bernd Bischl, Jörg Rahnenführer, and
          <string-name>
            <given-names>Michel</given-names>
            <surname>Lang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Benchmark for filter methods for feature selection in highdimensional classification data</article-title>
          .
          <source>Computational Statistics &amp; Data Analysis</source>
          <volume>143</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Gianluca</given-names>
            <surname>Bontempi</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Statistical foundations of machine learning (2nd edition) handbook</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Brownlee</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Master Machine Learning Algorithms: discover how they work and implement them from scratch</article-title>
          .
          <source>Machine Learning Mastery.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Deng</given-names>
            <surname>Cai</surname>
          </string-name>
          , Xiaofei He, and Jiawei Han.
          <year>2008</year>
          .
          <article-title>Training linear discriminant analysis in linear time</article-title>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Janez</given-names>
            <surname>Demšar</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Statistical comparisons of classifiers over multiple data sets</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>7</volume>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Zhenyun</given-names>
            <surname>Deng</surname>
          </string-name>
          , Xiaoshu Zhu, Debo Cheng, Ming Zong,
          <string-name>
            <given-names>and Shichao</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Eficient kNN classification algorithm for big data</article-title>
          .
          <source>Neurocomputing</source>
          <volume>195</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Drotár</surname>
          </string-name>
          , Juraj Gazda, and
          <string-name>
            <given-names>Zdenek</given-names>
            <surname>Smékal</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>An experimental comparison of feature selection methods on two-class biomedical datasets</article-title>
          .
          <source>Computers in biology and medicine 66</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Charles</given-names>
            <surname>Elkan</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Naive Bayesian Learning</article-title>
          .
          <source>(12</source>
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>François</given-names>
            <surname>Fleuret</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Fast binary feature selection with conditional mutual information</article-title>
          .
          <source>Journal of Machine learning research 5</source>
          ,
          <issue>9</issue>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Andrew Hall</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Correlation-based feature selection for machine learning</article-title>
          . (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Raj</given-names>
            <surname>Jain</surname>
          </string-name>
          .
          <year>1990</year>
          .
          <article-title>The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling</article-title>
          . John Wiley &amp; Sons.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Igor</given-names>
            <surname>Kononenko</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Estimating attributes: Analysis and extensions of RELIEF</article-title>
          .
          <source>In European conference on machine learning</source>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Andrea</surname>
            <given-names>S Laliberte</given-names>
          </string-name>
          , DM Browning,
          <string-name>
            <given-names>and Albert</given-names>
            <surname>Rango</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A comparison of three feature selection methods for object-based classification of sub-decimeter resolution UltraCam-L imagery</article-title>
          .
          <source>International Journal of Applied Earth Observation and Geoinformation</source>
          <volume>15</volume>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Jundong</given-names>
            <surname>Li</surname>
          </string-name>
          , Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu.
          <year>2017</year>
          .
          <article-title>Feature selection: A data perspective</article-title>
          .
          <source>ACM Computing Surveys (CSUR) 50</source>
          ,
          <issue>6</issue>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Hsi-Che</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pei-Chen</surname>
            <given-names>Peng</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tzung-Chien</surname>
            <given-names>Hsieh</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ting-Chi</surname>
            <given-names>Yeh</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chih-Jen</surname>
            <given-names>Lin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chien-Yu</surname>
            <given-names>Chen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jen-Yin</surname>
            <given-names>Hou</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee-Yung Shih</surname>
          </string-name>
          , and
          <string-name>
            <surname>Der-Cherng Liang</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Comparison of Feature Selection Methods for Cross-Laboratory Microarray Analysis</article-title>
          .
          <source>IEEE/ACM Transactions on Computational Biology and Bioinformatics</source>
          <volume>10</volume>
          ,
          <issue>3</issue>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Munirah</given-names>
            <surname>Mohd</surname>
          </string-name>
          <string-name>
            <surname>Yusof</surname>
          </string-name>
          , Rozlini Mohamed, and
          <string-name>
            <given-names>Noorhaniza</given-names>
            <surname>Wahid</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Benchmark of feature selection techniques with machine learning algorithms for cancer datasets</article-title>
          .
          <source>In Proceedings of the ICAIR-CACRE.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Antonio</surname>
            <given-names>Mucherino</given-names>
          </string-name>
          , Petraq J.
          <string-name>
            <surname>Papajorgji</surname>
          </string-name>
          , and
          <string-name>
            <surname>Panos</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Pardalos</surname>
          </string-name>
          .
          <year>2009</year>
          . kNearest Neighbor Classification . Springer New York.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Hai</given-names>
            <surname>Thanh</surname>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , Slobodan Petrović, and
          <string-name>
            <given-names>Katrin</given-names>
            <surname>Franke</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>A comparison of feature-selection methods for intrusion detection</article-title>
          .
          <source>In International conference on mathematical methods</source>
          , models, and
          <article-title>architectures for computer network security</article-title>
          . Springer,
          <fpage>242</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Dijana</surname>
            <given-names>Oreski</given-names>
          </string-name>
          , Stjepan Oreski, and
          <string-name>
            <given-names>Bozidar</given-names>
            <surname>Klicek</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Efects of dataset characteristics on the performance of feature selection techniques</article-title>
          .
          <source>Applied Soft Computing</source>
          <volume>52</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Hanchuan</surname>
            <given-names>Peng</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Fuhui</given-names>
            <surname>Long</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Ding</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy</article-title>
          .
          <source>Transactions on pattern analysis and machine intelligence 27</source>
          ,
          <issue>8</issue>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Marko</given-names>
            <surname>Robnik-Šikonja</surname>
          </string-name>
          and Igor Kononenko.
          <year>2003</year>
          .
          <article-title>Theoretical and empirical analysis of ReliefF and RReliefF</article-title>
          .
          <source>Machine learning 53, 1</source>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Ryan</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Urbanowicz</surname>
          </string-name>
          , Melissa Meeker, William La Cava, Randal S. Olson, and
          <string-name>
            <surname>Jason</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Moore</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Relief-based feature selection: Introduction and review</article-title>
          .
          <source>Journal of biomedical informatics 85</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Christiaan</given-names>
            <surname>Maarten Van der Walt</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Data measures that characterise classification problems</article-title>
          .
          <source>Ph.D. Dissertation</source>
          . University of Pretoria.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Joaquin</surname>
            <given-names>Vanschoren</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jan N. van Rijn</given-names>
            ,
            <surname>Bernd Bischl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Luis</given-names>
            <surname>Torgo</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>OpenML: Networked Science in Machine Learning</article-title>
          .
          <source>SIGKDD Explorations 15</source>
          ,
          <issue>2</issue>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Michel</given-names>
            <surname>Vidal-Naquet</surname>
          </string-name>
          and
          <string-name>
            <given-names>Shimon</given-names>
            <surname>Ullman</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Object Recognition with Informative Features and Linear Classification.</article-title>
          .
          <source>In International Conference on Computer Vision</source>
          , Vol.
          <volume>3</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Philip</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Wasserman</surname>
            and
            <given-names>Tom</given-names>
          </string-name>
          <string-name>
            <surname>Schwartz</surname>
          </string-name>
          .
          <year>1988</year>
          .
          <article-title>Neural networks. II. What are they and why is everybody so interested in them now</article-title>
          ?
          <source>IEEE expert 3</source>
          ,
          <issue>1</issue>
          (
          <year>1988</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Hilde</surname>
            <given-names>JP Weerts</given-names>
          </string-name>
          ,
          <article-title>Andreas C Mueller,</article-title>
          and
          <string-name>
            <given-names>Joaquin</given-names>
            <surname>Vanschoren</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Importance of tuning hyperparameters of machine learning algorithms</article-title>
          . arXiv preprint arXiv:
          <year>2007</year>
          .
          <volume>07588</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Zhou</surname>
            <given-names>Xu</given-names>
          </string-name>
          , Jin Liu, Zijiang Yang,
          <string-name>
            <given-names>Gege</given-names>
            <surname>An</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xiangyang</given-names>
            <surname>Jia</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The Impact of Feature Selection on Defect Prediction Performance: An Empirical Comparison</article-title>
          .
          <source>In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).</source>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Zheng</given-names>
            <surname>Zhao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Huan</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Spectral feature selection for supervised and unsupervised learning</article-title>
          .
          <source>In Proceedings of the 24th international conference on Machine learning.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>