INTRODUCTION

Impact of filter feature selection on classification: an empirical study

Uchechukwu F. NJOKU

unjoku@essi.upc.edu 0

Besim BILALLI

bbilalli@essi.upc.edu 1

Alberto ABELLÓ

aabello@essi.upc.edu 2

Gianluca BONTEMPI

gianluca.bontempi@ulb.be 3 0 Universitat Politècnica de Catalunya 1 Universitat Politècnica de Catalunya 2 Universitat Politècnica de Catalunya 3 Université Libre de Bruxelles

The high-dimensionality of Big Data poses challenges in data understanding and visualization. Furthermore, it leads to lengthy model building times in data analysis and poor generalization for machine learning models. Consequently, there is a need for feature selection, which allows identifying the more relevant part of the data to improve the data analysis (e.g., building simpler and more understandable models with reduced training time and improved model performance). This study aims to (i) characterize the factors (i.e., dataset characteristics) that influence the performance of feature selection methods, and (ii) assess the impact of feature selection on the training time and accuracy of binary and multiclass classification problems. As a result, we propose a systematic method to select representative datasets (i.e., considering the distributions of several dataset characteristics) in a given repository. Next, we provide an empirical study of the impact of eight feature selection methods on Naive Bayes (NB), Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), and Multilayer Perceptron (MLP) classification algorithms using 32 real-world datasets and a relative performance measure. We observed that feature selection is more efective in reducing training time (e.g., up to 60% for LDA classifiers) than improving classification accuracy (e.g., up to 5%). Furthermore, we observed that feature selection gives slight accuracy improvement for binary classification (i.e., up to 5%), while it mostly leads to accuracy degradation for multiclass classification. Although none of the studied feature selection methods is best in all cases, for multiclass classification, we observed that correlation based and minimum redundancy maximum relevance feature selection methods gave the best results in accuracy. Through statistical testing, we found LDA and MLP to benefit more in accuracy improvement after feature selection than KNN and NB.

INTRODUCTION

Nowadays, organizations store data without proper problem definition, hoping to gain some insight from the gathered data in the future. This ‘blind’ data collection leads to datasets that have a large number of features, often irrelevant or redundant [ 5 ] and raises challenges in data storage, understanding, visualization, model training time and explainability. In addition to these, there is a problem of insuficient number of samples (compared to the number of features) to build a model of suficiently good quality © Copyright 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) [ 5 ]. This is the case in fields like bioinformatics, where Microarray data could contain thousands of gene expressions as features but a few tens or hundreds of samples.

To fight these challenges, efective techniques to reduce the dimensionality of datasets are necessary. A common technique for this is Feature Selection (FS). FS reduces the dimensionality of a dataset by selecting a subset of the datasets’ features that are most relevant based on a defined criteria. FS methods can be classified by their reliance on a learning algorithm to select the most relevant features into filter , wrapper, and embedded FS methods [ 3 ].

Filter methods evaluate and select features independent of any learning algorithm by solely relying on the characteristics of the dataset to determine the relevance of a feature. This approach returns a subset of features that is not biased towards any learning algorithm and is thus highly generalizable but not optimal for any chosen learning algorithm.

Wrapper methods select features by going through an iterative process of selecting a subset of features and evaluating its quality by the performance (e.g., accuracy) of a model built using a particular learning algorithm. The cycle continues until arriving at a subset of features that optimizes the model performance or a pre-set halt condition is satisfied. Wrapper methods select a subset of features with the aim of optimizing the performance of the used learning algorithm hence lacking generalizability. Also, their iterative approach makes them time ineficient.

Embedded methods refer to learning algorithms that have FS encapsulated. Embedded methods are optimal and time-eficient because they use the target learning algorithm in the selection of features however, not iteratively. Examples of embedded methods are tree-based learning algorithms such as decision trees which prune irrelevant features while building the model, using criteria like Entropy or Gini (defined by Equations 1 and 9 respectively).

A FS method is afected by various dataset characteristics, like its number of features or instances. Therefore, it is imperative to understand the dataset characteristics that afect the performance of each FS method. Furthermore, depending on the data analysis task at hand, FS methods may impact the training time and accuracy of a model. Thus, the impact could difer for classification (binary, multiclass), clustering, and regression tasks. Besides impacting various tasks diferently, their efect could difer per learning algorithm. Hence, a context-specific understanding of the impact of FS is essential.

In this work, we focus on binary and multiclass classification tasks and study the impact of eight filter FS methods on four classification algorithms. To select representative datasets from the OpenML1 repository for the experiments, we use clustering 1https://www.openml.org discretization to find two clusters each for three dataset characteristics, namely: number of features, number of instances, and class balance; while the number of classes fits by default into two clusters (binary and multiclass). This leads to 16 possible combinations of the dataset characteristics, and for each combination, we choose two representative datasets.

In particular, we make the following contributions: (1) We identify the factors that influence the runtime of each

FS method. (2) We propose a systematic method to select representative datasets from a given repository. (3) We perform an extensive set of experiments, with varying sizes of datasets, and show the scalability of the FS methods. (4) We study the efect of FS on the accuracy and training time of binary and multiclass classification models and for both model types, we statistically determine the significant factors responsible for the observed efects and highlight the diferences in the outcomes of both model types. (5) We perform an extensive set of experiments with various feature subset sizes, and investigate how this impacts the observed changes in the accuracy and training time of the classifiers after FS. 2

RELATED WORK

The objective of FS is to build simpler and more understandable models, improve model performance, reduce their training time, and prepare data that are clean, understandable, and easier to visualize [ 17 ]. Due to its relevance in Big Data, researchers have focused on it, especially in recent decades. An essential aspect of this ongoing research has been: investigating the eficacy of the many proposed FS methods through benchmarks and comparisons. Many field-specific FS benchmarks have been conducted in which the datasets used are all related to a particular subject. For example, software defection data was used in [ 32 ], objectbased imagery data in [ 16 ], biomedical data in [ 10 ], cancer data in [ 19 ], intrusion detection data in [ 21 ], and text categorization datasets in [ 2 ]. This re-emphasizes the usefulness of FS across domains. However, there is an intrinsic relationship between dataset characteristics and the performance of FS methods’ [ 22 ]. Therefore the dataset homogeneity questions the robustness of these benchmarks as there is but little deviation in the dataset’s characteristics. For example, with biomedical datasets, it is often the case that there are a large number of features and a relatively small number of instances. A benchmark using only biomedical datasets could be biased towards this factor.

The eficacy of an FS method is usually evaluated using a learning algorithms’ performance. The choice of learning algorithms to use depends on the intended analysis task. By this, we mean tasks like binary or multiclass classification, regression, clustering, or any other data analysis task. For classification tasks, past benchmarks [ 4, 10, 22, 32 ] mainly focused on binary classiifcation, generally considered easier. Some field-specific benchmarks [ 2, 16, 18, 21 ] have also focused on multiclass classification. However, there is a lack of comparative studies highlighting the distinct impact of FS on both analysis tasks.

Using learning algorithms to evaluate and compare FS methods means comparing the performance of models built with the features selected by each FS method. The most commonly used metric is the model accuracy [ 4, 16, 21, 22 ]. Some others include: ROC [ 2, 19 ], AUC [ 2, 10, 32 ], F-Score [ 2, 19 ], Precision [ 2 ], Recall [ 2 ], and Kappa index [ 16 ]. Indeed, one can compare the performance of various FS methods using any of these metrics. However, these metrics alone, do not take into account the relative impact (i.e., in comparison to the models built without FS). [ 4 ] benchmarks 23 models, 22 built using the features selected by 22 FS methods, and one built without FS. Using the mean accuracy (from 10-fold cross-validation), they compared the FS methods by counting the number of datasets (out of 16) on which the mean accuracy of each model is higher than others.

In this work, we first propose a systematic method to find representative datasets in a given repository in order to remedy the problem of ’non-representative’ or ’homogeneous‘ datasets. Next, using this method we select 32 datasets from the OpenML[ 28 ] repository and benchmark eight filter FS methods with a relative measure, that is calculated as the change in training time (16) and accuracy (17) after FS. We use four classification algorithms in binary and multiclass classification tasks. Finally, we validate the significance of our results using statistical tests. 3

BACKGROUND

In the following, we provide the necessary definitions of the concepts appearing throughout this work. 3.1

Preliminaries

Entropy quantifies the uncertainty of a discrete random variable with sample space X = {1, 2, . . . , } and is defined as: ( ) = −

( ) log( ( )), ∑︁ X where ∈ X and ( ) is the prior probability of over all possible outcomes of i.e. X.

Conditional entropy measures the remaining uncertainty of X given another discrete variable Y with sample space Y = {1, 2, . . . , } and is defined as: ( | ) = − ∑︁ Y ( ) ∑︁ X ( | ) log( ( | )), (2) (1) (3) (4) (5) 3.2

Classification Algorithms

The classification task is a common problem in practice and is categorized into (i) binary classification where subjects are assigned to one of two classes; for instance the bank wants to know if a loan applicant is likely to default or not before issuing a loan and, (ii) multiclass classification where subjects are assigned to one of more than two classes, as in classifying the image of a plant into one of many species. where ( | ) is the conditional probability of given .

Mutual information is the measure of information shared between two discrete random variables , and is defined as: ( ; ) = ( ) − ( | ).

Conditional mutual information quantifies the amount of

information shared between two discrete random variables , when a third discrete random variable is known. It is defined as:

( ; | ) = ( | ) − ( | , ).

Bayes’ theorem states that given two events and , with () ≠ 0, then: (|) = ( |) () .

()

In this work, four classification algorithms that do not have FS embedded, and that are applicable to both binary and multiclass problems were used. Their details are discussed below.

Naive Bayes (NB) is a collection of classification algorithms that are based on Bayes’ theorem and the assumption that every pair of features , in a dataset are conditionally independent given the class feature i.e.: ( ∩ | ) = ( | ) ( | ). (6) NB classifies an instance to the most probable class value given its feature vector ⟨1, 2, . . . , ⟩ and has a time complexity of (.) [ 11 ] for a dataset with features and instances.

K-Nearest Neighbours (KNN) is a commonly used classiifcation algorithm that works on the ‘bird of the same feather lfock together’ principle. The class of an unknown instance is determined by that of its closest neighbors whose classes are known. KNN is considered a lazy algorithm because it does not build a classifier once for reuse, rather it scans through the training data to determine the class of an unknown instance each time it is called [ 20 ], and has a time complexity of (.) [ 9 ].

Linear Discriminant Analysis (LDA) is a classic classification algorithm that uses linear decision boundary(ies) to build a classifier for unknown instances. Its closed-form solution 2 with no need for hyperparameter tuning makes it easier to compute and it has demonstrated good performance in practice [ 23 ]. Given an unknown instance , the probability of belonging to a class of the target feature with classes is estimated by: () ( |) ( = | = ) = Í =1 ( ) ( | ) .

However, ( |) can be estimated with a Gaussian distribution function which when substituted leads to a simplified discriminant function for class given instance defined as: ( ) = 2 − 22 + ln( ()), 2 where is the mean value of for the class and 2 is the variance across all inputs [ 6 ]. LDA works under the assumption that the data is normally distributed and that the variance of all features are equal. It can also be used for dimensionality reduction by projecting the training data into a linear subspace such that the separability between classes is maximized [ 23 ]. LDA has a cubic time complexity of (.. + 3) [ 7 ], where t is the min(, ).

Multilayer Perceptron (MLP) is a type of feedforward artificial neural network consisting of three layers of nodes; an input layer, output layer3, and an arbitrary number of hidden layers. A full pass of data (epoch) through the MLP comprises forward and backward passes. In the forward pass, data is passed from the input layer through the hidden layer(s) to the output layer adjusting the weights of the nodes based on the ground truth. The backward pass goes in the opposite direction from the output layer to the input layers to minimize the error of the MLP. The classification model is built by several epochs of the training data through the MLP after which it is used to classify unknown instances.

The time complexity of MLP, (..ℎ . .) [ 1 ], depends on its architecture and configuration, i.e. the number of iterations , output neurons , and hidden layers each with ℎ neurons. MLP 2A closed-form solution (expression) is any formula that can be evaluated in a finite number of standard operations. 3The number of classes in the target feature is the number of nodes in the output layer. is applied in various fields including speech and image recognition, and natural language processing. [ 30 ] This classifier has the option of regularization which can be considered embedded FS, however this option was disabled for this work. 3.3

Feature Selection Algorithms

The generalizability of filter methods allow for easier comparison of their impact on multiple learning algorithms since the resulting subset of features is derived independently of any learning algorithm (unlike wrapper and embedded FS methods). Also, iflter methods are less time-consuming than wrapper methods because they do not require an iterative training of the learning algorithm to select a subset of dataset features. For these reasons, eight filter methods were studied in this work and presented below.

Gini index is a very common statistical measure used to quantify the ability of a feature to separate instances between classes [ 17 ]. A feature with distinct values is scored as: ( ) = min ( ) (1 −

( | )2) ∑︁ =1 ∑︁ =1 + ( ) (1 − ( | )2) , (7) (8) where is the ℎ (≤ ) class, and and denotes instances with value ≤ and > respectively where is some ℎ value of , this implies that separates the dataset into and . The maximum gini score in binary classification is 0.5 and lower values imply higher relevance.

ReliefF is a similarity based method which extends the classic relief method, not just for binary, but also multi-class classification [ 15 ]. The idea of this method is to select features that maximally distinguish between instances that are near to each other [ 25 ]. The score of a feature is defined as: (9) ( ( , ) − (, )) ( ( , ) − (, ))), (10) ( ) = 1 ∑︁ =1 (− ( ) 1

∑︁ + ∑︁ 1 ()

∑︁ ≠ ℎ 1 − () ( ,) where is the number of randomly selected instances (≤ ), ( ) of size is the set of instances closest to instance and in the same class, ( , ) with size ℎ is the set of instances closest to and in class , () is the probability of instances belonging to class , and (.) is the distance estimator between two instances.

Spectral Feature Selection (SPEC) is a similarity-based

method that works for supervised and unsupervised tasks [ 17 ]. SPEC selects features from a dataset by the following three steps [ 33 ]: (i) it builds a similarity set from using a similarity measure, e.g., Radial Basis Function (RBF) kernel, (ii) it extracts a representative graph from , and (iii) lastly, the structural information of obtained from its spectrum4 determines the relevance of each feature.

Conditional Mutual Information Maximization (CMIM) is an information theoretical-based method which iteratively picks features that maximize mutual information to the class feature conditioned on the already selected features [ 29 ]. Where is the subset of already selected features, the class feature, and 4The spectrum of a graph refers to its eigenvalues and their multiplicities. (11) (12) (13) (14) an unselected feature, the feature score of can be defined as: ( ) = min ( ; | ).

This approach ensures that the selected features are good predictors of the class label but minimally redundant.

Minimum Redundancy Maximum Relevance (mRMR) is also an information theoretical-based method proposed to select features that balance relevance and redundancy. A feature is scored as: 1 ∑︁ ( ) = ( ; ) − | | ( ; ), where the mutual information between the feature and target feature ( ; ) is the measure of relevance while |1 | Í ( ; ) tells the features redundancy compared to already selected features [ 4 ].

Joint Mutual Information (JMI) is an information theoretical-based method which iteratively selects features that complement already selected features with respect to the class feature. Its score is defined as: ( ) =

( ; | ). ∑︁

Eficient and Robust Feature Selection (RFS) is a sparse

learning based method that uses a joint 2,1-norm minimization on both the loss function and the regularization. This approach is more robust to noise and achieves group feature sparsity [ 17 ]. The objective function for RFS is:

min || − ||2,1 + || ||2,1, where , , and are the matrix of feature weights, matrix of feature values and one-hot encoding of the target feature, respectively and is a parameter of the regularization term || ||2,1.

Correlation-based Feature Selection (CFS) is a statistical based method with the basic idea of selecting a feature subset in which features are highly correlated with the class and uncorrelated with other selected features [ 13 ]. This implies that redundant features are less likely to be selected. The CFS score of a feature subset dubbed "merit" is defined as:..

() = √︃

| | | | + | | (| | − 1) , (15) where is the average feature-class correlation and is the average feature-feature inter-correlation.

CFS begins with an empty set of features, computes the symmetric uncertainty of each feature and then greedily adds features to the set using forward best-search heuristic strategy [ 13 ]. It stops after five consecutive searches with improvement in merit . 4

METHODOLOGY

To evaluate the impact of the presented FS methods on classification, we followed a systematic approach to define the components of the experiments carried out in this work and to select the datasets used. In what follows, we present the details.

Full Dataset Classification Algorithm

Filtered Dataset Legend:

Data

Process

Baseline

Flow

Target Flow The number of features to select, parameters of FS methods and classification algorithms, and dataset characteristics culminate into more factors than we can control in this work. Therefore, it is essential to clearly define the components of the experiments, as failing to do so can lead to mistakes (such as unrepresentative workload, erroneous analysis, or too complex analysis). We followed the systematic approach for performance evaluation proposed in [ 14 ] to avoid these common mistakes.

In what follows, we present the details of the ten steps in the systematic approach for performance evaluation.

(1) Goals and System Definition

In this work, the system under performance evaluation consists of the datasets (full and filtered), FS methods, and classification algorithms as shown in Figure 1. Within this system, we define two subsystems: (i) the target which represents the flow ⟨Full Dataset → FS Method → Filtered Dataset → Classification Algorithm ⟩, where FS is performed before model building and (ii) the baseline which represents the path ⟨Full Dataset → Classification Algorithms⟩ where the full dataset, without FS, is used to build the model.

The experiments we execute benchmark the target against the baseline subsystem for all the combinations of dataset, FS method, and classification algorithm in this work. The goals of the experiments conducted are to: • Measure the FS runtime for each dataset. • Measure the model training time of the classifiers built in the baseline and target subsystems. • Measure the performance of the classifiers built in the baseline and target subsystems. (2) Services and Outcomes

The FS methods and learning algorithms deliver the services of the system. Both processes take a dataset as input; the former outputs a subset of the datasets’ features while the latter produces a classifier model. (3) Metrics

Metrics are the criteria used to evaluate the performance of a system. The metrics used to evaluate the system under performance evaluation (Figure 1) are: • FS runtime, which is the time taken to obtain the feature subset/ranked features of a dataset. • Model Training Time (TT) change is the ratio of the diference in time between training the baseline and target classifiers to the baseline classifier training time. It is expressed as:

− ℎ5 = . (16) • Classifier accuracy change is the ratio of the diference between the balanced accuracy of the target classiifer and the baseline classifier to the balanced accuracy of the baseline classifier expressed as: ℎ6 = − . (4) Parameters

Several characteristics of the system components (datasets and algorithms) afect the performance of the system. These are called parameters and are categorized into workload and system parameters. The workload parameters are associated with the dataset and algorithms and include: Dataset parameters • Number of features () • Number of instances () • Number of classes () • Dimensionality, which is the feature-instance ratio of a dataset defined as:

= . (18) • Class Balance (), this is the distribution of instances amongst the classes in the target variable. It is measured by the Shannon entropy which is defined as: (17) (19) (20) • Processor speed • RAM/Disk size • Operating system context switching overhead (5) Factors to Study

The number of parameters controlled in an experiment determines its workload and completion time. Considering the exponential growth of the experimental design space with an increasing number of parameters, we considered the five parameters: number of features, instances, and classes, class balance, and feature subset size, as more important because they control the time complexity of the FS methods as shown in Table 4, as well as classification accuracy [ 22 ]. Also, some other workload parameters are derived from the selected factors or are known to not influence the results of FS methods, for instance, dimensionality is simply the feature-instance ratio and most FS methods do not factor in the average correlation of features during FS, but rather assess features individually. Hence, dimensionality and average feature correlation are not included in the study.

Factors such as processor speed or storage size, associated with the hardware on which the experiments were executed were constant as one machine was used for all executions. However, to mitigate the efects of context switches, especially for execution time, we employed 10 fold cross-validation to ensure accurate results. Lastly, the default hyper-parameters of the classification and FS algorithms were used since the same setting was applied to both baseline and target classifiers. Controlling all parameters for all algorithms would not only explode the design space but also shift our focus to hyper-parameter optimization, which is outside the scope of this work. (6) Evaluation Technique

The experiments were programmed with Python, which has the scikit-feature7 repository, scikit-learn8, and time9 libraries that implement the studied FS and learning algorithms as well as allow us to measure execution times. (7) Workload

The datasets used in the evaluation of the system constitute its workload. Considering four datasets characteristics with two clusters each (see Section 4.2), we systematically selected 32 real-world datasets from the OpenML repository for this work. We considered only datasets with no missing values as candidates to save time in the preprocessing stage and avoid the potential efect of missing values on the classifier’s performance (especially in cases with missing values being dropped. Lastly, datasets with only numeric values were considered due to the input constraint by the FS methods studied. (8) Experiment Design

Out of the eight FS methods studied, Gini, RFS, ReliefF, SPEC, and CFS methods return a ranking of the features in order of relevance, while JMI, MRMR, and CMIM return a subset of the features according to the specified size. Five possible subset sizes were studied, specifically ( ) , for i ∈ [0.5, 0.6, 0.7, 0.8, 0.9]. This means that a total of 20 runs (Table 1) were required for the eight FS methods and two classifiers (baseline and target) for each of the four classification algorithms. Therefore = − Í log( ) , =1

log where there are instances in class . A perfectly balanced dataset with equal number of instances in each class has a class balance of one while an extemely unbalanced dataset has a class balance of zero. • Class Entropy, this measures the entropy (Equation 1) of the classes in the target feature. • Average feature correlation, this measures the interdependence between all features and is quantified by = 1 ∑︁ ∑−︁1 ∑︁ =1 =1 =+1

| |, where is the Pearson’s correlation coeficient of features and , and is the total number of ’s summed. A value of zero indicates independence of all features while a value of one indicates high correlation between features and thus feature redundancy [ 27 ].

Algorithm parameters • Feature subset size (), which is the number of features to be selected; afects the results derived from the system. • Hyper-parameters, this refers to the several parameters in FS and classification algorithms that can be set to differing values to control the model building process [ 31 ]. These parameters afect the output of the algorithms and thus the system.

Lastly, the system parameters are the characteristics of the hardware used to execute the experiments for the system performance evaluation and include among others: 5There is an expected decrease in runtime i.e., Baseline TT > Target TT. 6There is an expected increase in accuracy i.e., Target Accuracy > Baseline Accuracy.

7https://github.com/jundongl/scikit-feature 8https://scikit-learn.org/stable 9https://docs.python.org/3

with all 32 datasets, a full factorial experimental design with 5,120 (32*20*8) experiments will be used. (9) Data Analysis and Interpretation

To derive insights from the results of the experiments, Z-test, Friedman, and Nemenyi statistical methods were used to analyze our results (i.e., measured metrics). For each of these tests, a null and alternative hypothesis were formulated and based on the test result, we either reject or fail to reject the set null hypothesis. (10) Results

Lastly, conclusions from the analysis of the results of the experiments are presented and discussed in Section 5. Several data repositories provide open data which enables research; OpenML was used in this work. Considering the input constraint of some of the algorithms studied, not all datasets in OpenML (over 21,000) could be used.

Available datasets were pruned based on the following eligibility criteria: • Number of features ≤ 1.000 • Number of instances ≤ 10.000 • No missing values • Numerical types (excluding timestamps) • Single target variable After pruning, there were 380 candidate datasets available and the metadata (number of features, instances, classes, and the class balance) of these datasets showed a wide range of values for the factors to be studied; for example, the number of features of the datasets ranged from 2 to 971. Also, we noticed repeated or highly identical datasets. So, we selected a subset that properly represents the available datasets while reducing redundancy.

Using the clustering discretization method which takes into account the intrinsic distribution of data values, each of the factors in the metadata were discretized into two bins. This was done for both the metadata values and the logarithm of the values as shown in Tables 2 and 3 respectively. It is indeed possible to discretize into more than two bins, however, this will lead to an exponential increase in the full factorial of the design space. With two bins and 4 dataset factors, we had 24 possible factor combinations all of which need representative datasets. However, with three bins, only nine out of 81 combinations had representative datasets. Increasing the number of bins will therefore result in more factor combinations with no representative datasets.

The discretization results from applying clustering on the log values was used to guide the selection of the datasets from OpenML because it best describes the distribution of the datasets in the repository and contained candidate datasets for all 16 combinations of the studied factors.

After the discretization, two datasets were selected randomly for each of the 16 factor combinations with a total of 32 datasets.

Factor Number of instances Number of features Class balance

Bin To begin, all features of the dataset were discretized using the uniformed discretization method; setting the number of bins to max{min{ 3 , 10}, 2} where is the number of instances in the dataset [ 4 ]. Discretizing the features is done to meet the input constraint of most FS methods which work with only discrete features.

Next, the dataset was split into ten partitions; nine partitions formed the training set, while one partition was used as the test set. After this split, the entire training set was used for FS as well as building the base classifier while the training data containing only the selected subset of features were used to build the target classifier. Both classifiers were tested on the test set and the accuracy was measured. The FS, model building, and testing are repeated ten times for each dataset (i.e., 10-fold crossvalidation); each time a new partition is used as the test data while the remaining nine are used as the training data. Figure 2 gives an overview of the described execution, from discretization to measure recording. At the end of each of the 10-fold execution, the median runtime and average balanced accuracy were recorded. The median was used in the runtime for robustness against possible variations due to context switching, and since we use a diferent training set for each fold, there might be slight diferences in the accuracy, hence the average serves as a global accuracy.

Due to the poor stability of some FS methods, diferent features might be selected with a change of data samples. Hence, FS and model building were executed together for each iteration to ensure that the performance evaluation is done on the same data. The Python source code and datasets information for the experiments are publicly available on Github10. 5

RESULTS

There is an expectation for improved model runtime and accuracy after FS. However, FS itself could be time expensive leading to an overhead that causes the total time needed for FS and model 10https://github.com/F-U-Njoku/filter-fs-impact-on-classification

Discretization

Split 1 Testing Set ii n G New Testing

Set

Training

Set 9 M I M

Feature Selector Training Time

Classifier

Data ndeDocument g e

L Process 10-Fold Cross Validation

New Training

Set Balanced

Accuracy F lif e e R

C E P S

R M R M

I M J

S F R

S F C

KNN

LDA

MLP building to increase beyond the time it takes to build the model without FS. In this section, we present and discuss the results of our experiments on eight FS methods and four classification algorithms using 32 datasets with feature subset size set to 0.5, 0.6, 0.7, 0.8, and 0.9 where is the number of features. The runtime of a FS method is afected by various factors and in difering proportions. Some are influenced more by the number of features, others by the number of instances, and so on. The time complexity of the studied FS methods are presented in Table 4, this shows the factors which afect the runtime of each method.

Gini has the lowest runtime and is influenced mainly by the feature and class size. ReliefF and SPEC are controlled by the instance and feature size, with the former having a quadratic impact on the runtime. CMIM, MRMR, and JMI all have the same time complexity, with feature and subset size being the factors that predominantly influence their runtime. Lastly, RFS and CFS have the longest runtimes due to the numerous matrix multiplication operations required. ) s dn 3,000 o c e S ( e itm 2,000 n u R 1,000 0 0.1 all datasets; this is due to the multiple computations of pairwise correlations between features required by this method. Second to CFS was RFS which took an average of 14 minutes. The other methods took less than 50 seconds on average, while Gini was the most eficient, taking less than a second on average.

Although FS is time expensive for some methods, the advantage of filter methods is their generalizability. The selected features can be used repeatedly across various learning algorithms for model selection since they were obtained independently of any learning algorithm. The dataset size, particularly the feature and instance sizes, control the time it takes to build a classifier, as discussed in Section 3. For learning algorithms that build classifiers after training, such as NB, LDA, and MLP, FS reduces the model training runtime as expected. For LDA, we recorded on average for all datasets (i.e., both binary and multiclass classification), over 50% training time reduction when the feature subset size, was set to 0.5 and over 13% for set to 0.9 as shown in Figure 4a. In this case, as expected, selecting fewer features led to faster model building.

On the other hand, a lazy learning algorithm like KNN does not build a classifier at the point of training; rather, it stores useful metadata that is used at the point of classification. Hence, KNN does not benefit from FS in the expected manner. As shown in Figure 4b, KNN follows a pattern diferent from other algorithms and gives the best training time improvement with set to 0.8 in most cases. Therefore, the benefit of reduced model runtime after FS depends on the learning algorithm. Accuracy is one of the methods used to measure the performance of a model and due to the presence of unbalanced datasets in the workload, we specifically use the balanced accuracy.

Since one of the objectives for FS is to build models with improved performance that generalize the data better, the expectation is to have an improvement in the model accuracy after FS. However, as shown in Figure 6, the change in accuracy after FS difers for binary and multiclass classifications. For binary classiifcation, there was an average improvement of the accuracy after FS for all methods except SPEC. In contrast, reducing the number of features generally led to accuracy decrease for multiclass 0.5 0.6 0.7 0.8 0.9 classification. Although SPEC is applicable to both supervised and unsupervised learning tasks, it gave the worst performance for both binary and multiclass classification as shown in Figures 5 and 6; this is because it performs FS independent of the target variable; hence, the selected features by SPEC tend to be less relevant compared to those selected by other methods.

Also shown in Figure 5, CFS yielded good results in most cases; this is because it uses the correlation of features to the target feature as well as the pairwise correlation of features for FS. This is useful because it ensures minimal redundancy in the selected features. For multiclass classification, CFS and MRMR gave the best result (i.e. the least accuracy degradation). Therefore, although FS generally results in a decrease in multiclass classification accuracy, in the case that the dataset is too large and FS is indeed needed, CFS and MRMR methods are suggested.

Since the improvement in accuracy is diferent for binary and multiclass classifications, we analyze the change in accuracy for both classification types separately with the aim of identifying the factors that most influence the accuracy change.

5.3.1 Statistical Analysis.

By visualization, we appreciate a diference in the accuracy change obtained for binary and multiclass classifications (Figure 5) and this is irrespective of the FS method used (Figure 6). We further our investigation by testing the significance of this observation statistically for the subset size set to 0.7 and an alpha level of 0.05.

We begin by making the following hypothesis: • 0: The classification type does not influence the change in accuracy derived from applying FS. • 1: The classification type influences the change in accuracy derived from applying FS.

To test the aforementioned hypothesis, we consider two independent groups. The binary and multiclass groups, each made up of accuracy change between baseline and target classifiers after FS on the 16 datasets (512 result points). Testing the previous hypotheses using Z-test, the results show the impact of FS on binary classification in terms of accuracy change ( = 0.0058, = 0.0506, = 508) was significantly higher than on multiclass classification ( = −0.0627, = 0.1483, = 512), = 9.2589, < 0.001. Therefore, the classification type indeed influences the change in accuracy derived from applying FS.

Besides the type of classification, the various classification algorithms could interact with the FS methods to give diferent results in terms of accuracy change. Hence, using Freidmans’ test, we evaluate the accuracy changes for binary and multiclass classification separately. Friedman ( ) test [ 8 ] ranks the accuracy change of the four classifiers for each dataset-FS method pair with the highest rank being 1 and the smallest rank 4. In the case of ties, the average rank is used. We test the null hypothesis that all algorithms benefit equally from FS. Therefore there should be an equal ranking of the derived accuracy change for each dataset and FS selection pair.

For binary classification, using 16 datasets 11, eight FS methods, and four classification methods, the -statistic is distributed according to the distribution with (4 − 1) and (4 − 1) (127 − 1) = 378 degrees of freedom. The critical value of F(3,378) is 2.6285, and the derived F-statistic is 2.47405553 giving = 0.0612. Therefore, we fail to reject the null hypothesis and conclude that the studied classification algorithms similarly gain from FS for binary classification.

For multiclass classification, using 16 datasets, eight FS methods, and four classification methods, the -statistic is distributed according to the distribution with (4 − 1) and (4 − 1) (128 − 1) = 381 degrees of freedom. The critical value of F(3,381) is 2.6283 and the derived F-statistic is 11.41980269 giving < 0.001 . Therefore we reject the null hypothesis and proceed to do a post-hoc test. We use the Nemenyi test with critical value 2.569 to compare the classifiers to each other, which yields a Critical Diference (CD) of 0.4145. Using the CD, we identify two groups of algorithms as shown in Figure 7: LDA and MLP benefit more in terms of accuracy improvement after FS compared to KNN and NB. Therefore, for multiclass classification, the impact of FS on accuracy is afected by the choice of the algorithm. 11For CFS FS method, the largest dataset was left out as it was not completed within reasonable time. 0.5 0.6 0.7 0.8 0.9 From the results of our experiments presented previously and in the supplementary material12, it is clear that there is no FS method that is best for all cases, i.e., all datasets, tasks, and learning algorithms. Choosing a feature selection method is, therefore, use-case dependent. The first step is to define the analysis task for which we considered binary and multiclass classification. Feature selection behaves diferently for these tasks, particularly in terms of how it impacts model performance and execution time.

For multiclass classification, feature selection does not improve the model performance (i.e., accuracy); on the contrary, it degrades it; and the smaller the feature subset size, the worse the decline in the performance. For this reason, our analysis suggests that feature selection should be avoided for multiclass classification. Yet, in case necessary (e.g., for performance/training time reasons), then either CFS (for small-sized datasets) or MRMR (for large-sized datasets) can be used, because they gave the least 12https://github.com/F-U-Njoku/filter-fs-impact-on-classification Gini

ReliefF SPEC

CMIM

MRMR

JMI

RFS

CFS accuracy degradation and improvement in a few cases compared to the other studied methods.

In the case of binary classification, we observed improvements in the classification accuracy, which depends on the chosen classiifcation algorithm. For all five feature subset sizes that we set, CFS and MRMR increased the model accuracy for the NB classifier. For the KNN classifier, CFS, ReliefF, Gini, and CMIM improved accuracy in that order. For the LDA classifier, CFS, ReliefF, Gini, and CMIM improved accuracy in that order , and lastly, the MLP classifier with CFS led to improvement in accuracy. It is tempting to assume that CFS is the best choice to make when the goal is to improve performance for binary classification. However, we must recall from Figure 3 that this method is computationally expensive and is unsuitable for large datasets. Hence these recommendations must be considered taking into account the dataset size and FS methods complexity as presented in 5.1.

Finally, whether binary or multiclass classification, Gini is a time-eficient method, and especially for very large datasets, we propose this feature selection method to improve runtime. Figure 8, summarizes our recommendations based on the results of our experiments and given the goal, task, and classification algorithm.

6 CONCLUSIONS AND FUTURE WORK

In this work, we studied the impact of eight filter FS methods on four classification algorithms using 32 real-world datasets. We Multiclass

Goal 1. CFS 2. MRMR 1. CFS 2. Gini 3. ReliefF 4. MRMR 1. CFS 2. ReliefF 3. Gini 4. CMIM

CFS found that on average, FS improves the accuracy for binary classification, however, for multiclass classification, feature selection leads to a decline in model accuracy.

With regards to FS runtime, the Gini FS method was the most eficient with an average runtime of less than a second making it the recommended FS method when minimizing runtime is the objective. However, when the goal is to improve classifier accuracy, none of the studied FS methods gives best accuracy improvement in all cases. Although FS on multiclass classification on the average led to a degradation in classifier accuracy, we observed that CFS and MRMR methods gave the best result in that they yielded the least degradation in accuracy and led to improvement in some cases.

As further work, we plan to extend this work to larger datasets with more clusters of dataset factors using multiple repositories to facilitate finding representative datasets. Also, we intend to investigate the dependence of multiclass classification performance degradation after FS on the multiclass classification strategy and the KNN training time degradation with the reduction in the number of features for some cases.

Since FS applies to supervised and unsupervised tasks, this work can be extended to regression and clustering tasks. Lastly, several existing methods have polynomial runtimes and do not scale eficiently; more eficient implementations of these methods aiming for linear or sub-linear time complexity are highly needed, especially with Big Data being ubiquitous.. 7

ACKNOWLEDGMENTS

The project leading to this publication has received funding from the European Commission under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 955895).

[1]

Hajar

Alla , Lahcen Moumoun, and

Youssef

Balouki . 2021 . A Multilayer Perceptron Neural Network with Selective-Data Training for Flight Arrival Delay Prediction . Scientific Programming 2021 ( 2021 ).

[2]

Yindalon

Aphinyanaphongs , Lawrence D Fu , Zhiguo

Li , Eric R Peskin, Efstratios Efstathiadis, Constantin F Aliferis, and Alexander

Statnikov . 2014 . A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization . Journal of the Association for Information Science and Technology 65 , 10 ( 2014 ).

[3]

Verónica

Bolón-Canedo , Noelia Sánchez-Maroño, and Amparo AlonsoBetanzos . 2013 . A review of feature selection methods on synthetic data . Knowledge and information systems 34 , 3 ( 2013 ).

[4]

Andrea

Bommert , Xudong Sun, Bernd Bischl, Jörg Rahnenführer, and

Michel

Lang . 2020 . Benchmark for filter methods for feature selection in highdimensional classification data . Computational Statistics & Data Analysis 143 ( 2020 ).

[5]

Gianluca

Bontempi . 2021 . Statistical foundations of machine learning (2nd edition) handbook .

[6]

Jason

Brownlee . 2016 . Master Machine Learning Algorithms: discover how they work and implement them from scratch . Machine Learning Mastery.

[7]

Deng

Cai , Xiaofei He, and Jiawei Han. 2008 . Training linear discriminant analysis in linear time . IEEE.

[8]

Janez

Demšar . 2006 . Statistical comparisons of classifiers over multiple data sets . The Journal of Machine Learning Research 7 ( 2006 ).

[9]

Zhenyun

Deng , Xiaoshu Zhu, Debo Cheng, Ming Zong,

and Shichao

Zhang . 2016 . Eficient kNN classification algorithm for big data . Neurocomputing 195 ( 2016 ).

[10]

Peter

Drotár , Juraj Gazda, and

Zdenek

Smékal . 2015 . An experimental comparison of feature selection methods on two-class biomedical datasets . Computers in biology and medicine 66 ( 2015 ).

[11]

Charles

Elkan . 1998 . Naive Bayesian Learning . (12 1998 ).

[12]

François

Fleuret . 2004 . Fast binary feature selection with conditional mutual information . Journal of Machine learning research 5 , 9 ( 2004 ).

[13]

Mark

Andrew Hall . 1999 . Correlation-based feature selection for machine learning . ( 1999 ).

[14]

Raj

Jain . 1990 . The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling . John Wiley & Sons.

[15]

Igor

Kononenko . 1994 . Estimating attributes: Analysis and extensions of RELIEF . In European conference on machine learning . Springer.

[16] Andrea

S Laliberte

, DM Browning,

and Albert

Rango . 2012 . A comparison of three feature selection methods for object-based classification of sub-decimeter resolution UltraCam-L imagery . International Journal of Applied Earth Observation and Geoinformation 15 ( 2012 ).

[17]

Jundong

Li , Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. 2017 . Feature selection: A data perspective . ACM Computing Surveys (CSUR) 50 , 6 ( 2017 ).

[18] Hsi-Che

Liu

, Pei-Chen

Peng

, Tzung-Chien

Hsieh

, Ting-Chi

Yeh

, Chih-Jen

Lin

, Chien-Yu

Chen

, Jen-Yin

Hou

, Lee-Yung Shih , and Der-Cherng Liang . 2013 . Comparison of Feature Selection Methods for Cross-Laboratory Microarray Analysis . IEEE/ACM Transactions on Computational Biology and Bioinformatics 10 , 3 ( 2013 ).

[19]

Munirah

Mohd Yusof , Rozlini Mohamed, and

Noorhaniza

Wahid . 2016 . Benchmark of feature selection techniques with machine learning algorithms for cancer datasets . In Proceedings of the ICAIR-CACRE.

[20] Antonio

Mucherino

, Petraq J. Papajorgji , and Panos

Pardalos . 2009 . kNearest Neighbor Classification . Springer New York.

[21]

Hai

Thanh Nguyen , Slobodan Petrović, and

Katrin

Franke . 2010 . A comparison of feature-selection methods for intrusion detection . In International conference on mathematical methods , models, and architectures for computer network security . Springer, 242 - 255 .

[22] Dijana

Oreski

, Stjepan Oreski, and

Bozidar

Klicek . 2017 . Efects of dataset characteristics on the performance of feature selection techniques . Applied Soft Computing 52 ( 2017 ).

[23]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , and

Duchesnay . 2011 . Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research 12 ( 2011 ).

[24] Hanchuan

Peng

Fuhui

Long , and

Chris

Ding . 2005 . Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy . Transactions on pattern analysis and machine intelligence 27 , 8 ( 2005 ).

[25]

Marko

Robnik-Šikonja and Igor Kononenko. 2003 . Theoretical and empirical analysis of ReliefF and RReliefF . Machine learning 53, 1 ( 2003 ).

[26] Ryan

Urbanowicz , Melissa Meeker, William La Cava, Randal S. Olson, and Jason

Moore . 2018 . Relief-based feature selection: Introduction and review . Journal of biomedical informatics 85 ( 2018 ).

[27]

Christiaan

Maarten Van der Walt . 2008 . Data measures that characterise classification problems . Ph.D. Dissertation . University of Pretoria.

[28] Joaquin

Vanschoren

, Jan N. van Rijn , Bernd Bischl , and

Luis

Torgo . 2013 . OpenML: Networked Science in Machine Learning . SIGKDD Explorations 15 , 2 ( 2013 ).

[29]

Michel

Vidal-Naquet and

Shimon

Ullman . 2003 . Object Recognition with Informative Features and Linear Classification. . In International Conference on Computer Vision , Vol. 3 .

[30] Philip

Wasserman and Tom

Schwartz . 1988 . Neural networks. II. What are they and why is everybody so interested in them now ? IEEE expert 3 , 1 ( 1988 ).

[31] Hilde

JP Weerts

, Andreas C Mueller, and

Joaquin

Vanschoren . 2020 . Importance of tuning hyperparameters of machine learning algorithms . arXiv preprint arXiv: 2007 . 07588 ( 2020 ).

[32] Zhou

, Jin Liu, Zijiang Yang,

Gege

An , and

Xiangyang

Jia . 2016 . The Impact of Feature Selection on Defect Prediction Performance: An Empirical Comparison . In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[33]

Zheng

Zhao and

Huan

Liu . 2007 . Spectral feature selection for supervised and unsupervised learning . In Proceedings of the 24th international conference on Machine learning.