Effect of Metalearning on Feature Selection Employment Silvia Nunes das Dôres 1 , Carlos Soares 2 , Duncan Ruiz 1 1 Pontifı́cia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brasil 2 INESC TEC/Faculdade de Engenharia, Universidade do Porto, Porto, Portugal Abstract. Feature Selection is important to improve learning performance, reduce computational complexity and decrease required storage. There are multiple methods for feature selection, with varying impact and computa- tional cost. Therefore, choosing the right method for a given data set is important. In this paper, we analyze the advantages of metalearning for fea- ture selection employment. This issue is relevant because a wrong decision may imply additional processing, when FS is unnecessarily applied, or in a loss of performance, when not used in a problem for which it is appropriate. Our results showed that, although there is an advantage in using metalearn- ing, these gains are not yet sufficiently relevant, which opens the way for new research to be carried out in the area. Keywords: feature selection, metalearning, metalearning evaluation 1 Introduction Feature selection (FS) is a widely employed technique for reducing dimensionality [9]. It aims to select a subset of variables from the input data, that can efficiently describe these data while reducing effects from noise or irrelevant variables and still provide good prediction results, as well as lower computational cost and better model interpretability [5]. Recent works have approached the problem of FS method selection under the perspective of metalearning [1], showing interesting first results and paving the way for novel applications to be fully explored [11], [8], [4], [7], [6]. These works are based on the well-known no-free-lunch theorem [12] which assumes that there is no single FS algorithm that is always the best for all problems. In order to prevent the data miner to experiment with different methods using a trial and error approach, which can be time consuming and costly especially with very large data sets, metalearning is used for the automatic recommendation of the FS algorithm that is best suited for a given problem. However, existing approaches on metalearning for FS method selection focus their evaluation at the meta-level, e.g. whether it is possible to guess which is the best FS method for a given data set. They ignore the magnitude of the gain/loss that the method yields in terms of base-level performance (e.g., what is the difference in classification performance of applying a learning algorithm with and without the selected FS method). Thus, the purpose of this study is to extend the evaluation of metalearning ap- proaches to FS method selection by taking into account the-base level performance. 2 Nunes das Dôres et al. We address two research questions. First, we investigate whether metalearning is useful for indicating when to use FS or not. The results show the advantages of using metalearning when compared to baselines as random choice and choice based on the majority class. In the second question, we analyze the gains in base-level per- formance associated with the use of metalearning. In this experiment, the results obtained by the metalearning strategy are very close to those obtained by an ideal meta-model (oracle), although they are also close to the recommendation based on the majority class. This paper is organized as follows: Section 2 discusses related work. Section 3 shows our experimental analysis and results obtained. Section 4 presents our conclusions and future work. 2 Related Work In the context of FS, the use of metalearning is relatively new and there are only a few studies that offer significant advances towards solving it [11], [8], [4], [7], [6]. Post et al. [7] presented a large scale experiment on the benefits of using FS for binary classification problems. They used 394 data sets, 12 classification algorithms and 1 FS method in two experiments: the first one investigates for which learning algorithms FS improves predictive performance, and the second experiment applies metalearning to recommend when to use FS for each learning algorithm. The first results pointed out the usefulness of FS for the different learning algorithms. In the second experiment the focus was the evaluation of different sets of meta-features under the behavior of metalearning strategy. The results show benefits of the use of metalearning, but the proposed strategy is not analyzed in terms of other rec- ommendation strategies neither at the meta-level nor at the base-level, where the obtained performance gain is evaluated. 3 Metalearning for the Recommendation of Feature Selection Methods In this paper, we intend to increase the quality of the FS employment in classification problems, extending the experiments carried out in [7] in order to evaluate the advantages of applying metalearning to decide when to employ FS, both at the meta-level and at the base-level. We address two research questions: 1. Is metalearning useful for indicating when to use FS or not? 2. What are the gains in base-level performance associated with the use of met- alearning? The experimental setup is detailed below. 3.1 Experimental Setup As discussed earlier, we extend the experiments carried out in [7], since it is the most comprehensive study in the use of metalearning for FS method recommendation, Effect of Metalearning on Feature Selection Employment 3 and also because it is the only one that uses metalearning in order to indicate when to use FS. For this purpose, we used the experimental results of the base- level provided by the authors in the OpenML platform to evaluate the performance gain of the proposed metalearning strategy. Further details about the data sets, algorithms, and meta-features can be obtained from OpenML 1 . Briefly, 394 data sets of binary classification problems were used, containing be- tween 10 to 200,000 examples. One FS method (CfS-SubsetEval, a correlation-based method) and 12 machine learning algorithms (AdaBoost, HoeffdingTree, K-Nearest Neighbor, Logistic, Multilayer Perceptron, J48, JRip, Naive Bayes, Random Forest, REPTree, Stochastic Gradient Descent, Support Vector Machine) were applied to these data sets, using 10-fold-cross-validation. For each pair data set/learning algo- rithm the performance was evaluated with and without FS, based on the AUC mea- sure, given the imbalanced distribution of classes in the data sets. At the meta-level, 40 meta-features were extracted from each data set (simple, statistical, information- theoretic, landmarkers and FS landmarkers). Every instance in the meta-database are two-fold cross validation runs on a algorithm, one run with FS and one run without FS, and the meta-target is a label indicating whether the run with FS had a better performance. As meta-learner was used the Random Forest algorithm (with 100 trees). In this work, we employ a typical leave-one-out cross-validation procedure (LOO- CV) to evaluate the quality of the recommendation made by the metalearning strat- egy, both at the meta-level and at the base-level. 3.2 Meta-level Evaluation For the first question, we carried out a set of experiments with the goal of providing a proof of concept in metalearning applied to FS recommendation. By testing whether the use of metalearning increases the quality of the decision between to use FS or not, we aim to show that using this strategy can avoid an ad hoc choice. We compare it with four distinct baselines: i) alwaysFS - a strategy that recommends the application of FS for all problems; ii) neverFS - a strategy that never recommends the use of FS; iii) random - a strategy that randomly selects whether to use FS or not; and iv) default - a strategy that selects the most frequent choice between using FS or not (majority class) in the data sets considered, for each algorithm. Table 1 shows the results of the experiment in terms of performance measured by AUC. With these results, it is possible to see that metalearning shows a superior per- formance for all the learning algorithms. In ROC curves an ideal result is one that maximize the true positive rate while minimizing the false positive rate. As ex- pected, the strategies alwaysFS, neverFS and default obtained an average perfor- mance, since they can not minimize the false positive rate. The random strategy has a balance between the true positive rate and false positive rate, which also leads to an average performance. Thus, metalearning stands out as being the best approach for decision making about the use of FS. However, in a statistical validation we can see that, although there is a difference between the use of metalearning and the random strategy, this difference is not 1 https://www.openml.org/s/15 4 Nunes das Dôres et al. Table 1. Performance at the Meta-Level Algorithm metalearning alwaysFS neverFS random default AdaBoost 0.686 0.5 0.5 0.495 0.5 HoeffdingTree 0.661 0.5 0.5 0.504 0.5 K-NN 0.766 0.5 0.5 0.506 0.5 J48 0.761 0.5 0.5 0.522 0.5 Jrip 0.656 0.5 0.5 0.507 0.5 Logistic 0.719 0.5 0.5 0.498 0.5 MLP 0.716 0.5 0.5 0.537 0.5 NaiveBayes 0.728 0.5 0.5 0.492 0.5 REPTree 0.688 0.5 0.5 0.507 0.5 RandomForest 0.612 0.5 0.5 0.517 0.5 SGD 0.71 0.5 0.5 0.503 0.5 SVM 0.722 0.5 0.5 0.526 0.5 average 0.702 0.5 0.5 0.509 0.5 statistically significant. In contrast, the difference between using metalearning and alwaysFS, neverFS and default is statistically relevant. These tests were carried out using the methodology proposed by Demšar [3]: Friedman rank test with Nemenyi test for post-hoc multiple comparisons, and presented in Figure 1. Recommendation strategies are sorted by their average ranking (lower is better), and those connected by a horizontal line are statistically equivalent. CD 1 2 3 4 5 metalearning alwaysFS random default neverFS Fig. 1. Critical Difference diagram (with α = 0.05) of the experiments at the meta-level. 3.3 Base-Level Evaluation In this experiment, our goal is to evaluate the performance gain obtained through the use of metalearning for FS employment at base-level, i.e, directly on the per- formance of the learning algorithms. For this, metalearning was applied over the 12 meta-databases and, for each data set, the recommendation to use FS or not in a given data set was scored according to the performance obtained by the correspond- ing class at the base-level, in terms of AUC. As baselines, we use two approaches: i) default - a strategy that select the most frequent choice between using FS or not (majority class); and ii) oracle - an ideal meta-model that always selects the best option for each data set. Table 2 presents these results. Results show a small advantage in the average performance of metalearning relative to the default strategy. Despite being small, the advantage is consistent for all algorithms. In addition, the results also show a small difference between the Effect of Metalearning on Feature Selection Employment 5 Table 2. Performance Gain on Base-Level Algorithm oracle metalearning default AdaBoost 0.873 0.868 0.867 HoeffdingTree 0.804 0.787 0.772 K-NN 0.814 0.796 0.795 J48 0.809 0.798 0.794 Jrip 0.805 0.796 0.796 Logistic 0.828 0.817 0.808 MLP 0.817 0.801 0.795 NaiveBayes 0.827 0.819 0.812 REPTree 0.802 0.783 0.776 RandomForest 0.885 0.881 0.877 SGD 0.769 0.763 0.756 SVM 0.767 0.762 0.756 average 0.817 0.806 0.800 performance of metalearning and the oracle, which shows a benefit in the application of the latter strategy of recommendation. The Critical Difference (CD) diagrams generated in the base-level experiments is presented in Figure 2. CD 1 2 3 oracle default metalearning Fig. 2. Critical Difference diagram of the experiments at the base-level (α = 0.05). This diagram reflects the results presented in Table 2, indicating that there is no statistical difference between metalearning and the default recommendation (majority class) in terms of performance gain. In addition, we can also note that, although the average performance of metalearning is close the oracle, statistical analysis shows that this difference is significant. However, when we take into account that the application of FS does not signif- icantly improve the performance of the learning algorithms for most data sets [7], we can evaluate the gains obtained by the metalearning for problems in which the use of FS achieves greatest impact. Table 3 presents these results. For each learning algorithm, we considered the amount of data sets for which the application of FS get a gain above 5% in the final result and the percentage of metalearning hits on these significant sets. Figure 3 presents details of these results for the K-NN and SGD algorithms. Since the instance-based algorithms are sensitive to the curse of dimensionality, it was expected that FS would bring performance gains to the K-NN, which was confirmed in the results. In general, the application of FS provided a performance increase greater than 5% in 154 data sets (about 39%), and among these, metalearning achieved a hit rate of 83.12%. In the case of the SGD algorithm, it was expected that its performance would not be affected by the number of features, so it did not 6 Nunes das Dôres et al. Table 3. Effect of metalearning on significant data sets Algorithm Frequency Hit Rate AdaBoost 15 26.70% HoeffdingTree 127 60.62% K-NN 154 83.12% J48 63 58.73% Jrip 42 19.04% Logistic 99 56.56% MLP 91 66.66% Naive Bayes 114 70.17% REPTree 119 53.78% Random Forest 41 53.66% SGD 72 55.55% SVM 74 51.35% have great advantage in the FS application. However, the results show that about 18% of the data sets benefited from the FS application, and that metalearning obtained a 55.5% hit rate on these data. 140 K-Nearest Neighbor 250 Stochastic Gradient Descent 120 200 100 Frequency 150 Frequency Data sets 80 Data sets Hits Hits 60 100 40 50 20 0 0 -0.38 -0.34 -0.30 -0.25 -0.21 -0.17 -0.13 -0.09 -0.05 -0.01 0.04 0.08 0.12 0.16 0.20 0.24 0.28 0.33 0.37 More -0.38 -0.34 -0.30 -0.27 -0.23 -0.20 -0.16 -0.13 -0.09 -0.05 -0.02 0.02 0.05 0.09 0.13 0.16 0.20 0.23 0.27 Mais Impact of FS Impact of FS Fig. 3. Histogram of the hit rate hits of metalearning compared with the frequency with which FS impacts in the data sets. 4 Conclusions In this work we evaluate the use of metalearning to indicate when to employ feature selection in conjunction with different classification algorithms. The results showed that both at the meta-level and at the base-level, metalearning outperforms the baselines analyzed, although in the statistical test the difference is not always sig- nificant. However, when we analyze the advantages of metalearning in the problems where FS causes the greatest impact in terms of performance, we can observe good results. Thus, suggestions for future works in this area indicate the possibility of extending the analysis to include new FS methods, in order to better generalize the results in relation to diversified inductive biases. In addition, we can develop new meta-features more appropriate for the problem. Effect of Metalearning on Feature Selection Employment 7 Acknowledgments. This work is financed by the ERDF European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation - COM- PETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by Na- tional Funds through the FCT Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) as part of project UID/EEA/50014/2013. The authors also would like to thank CAPES for the financial support through the project PDSE/88881.135787/2016-01. References 1. Brazdil, P., Carrier, C.G., Soares, C., Vilalta, R.: Metalearning: Applications to data mining. Springer-Verlag Berlin Heidelberg (2008) 2. Chu, C., Hsu, A.L., Chou, K.H., Bandettini, P., Lin, C.: Does feature selection improve classification accuracy? Impact of sample size and feature selection on classification using anatomical magnetic resonance images. 60, 59–70 (2012) 3. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research. 7, 1–30 (2006) 4. Filchenkov, A., Pendryak, A.: Datasets meta-feature description for recommending fea- ture selection algorithm. In: Artificial Intelligence and Natural Language and Informa- tion Extraction, Social Media and Web Search FRUCT Conference, pp. 11–18. IEEE, St. Petersburg (2015) 5. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of machine learning research. 3, 1157–1182 (2003) 6. Parmezan, A.R.S.,Lee, H.D.,Wu, F.C.: Metalearning for choosing feature selection algo- rithms in data mining: Proposal of a new framework. Expert Systems with Applications. 75, 1–24 (2017) 7. Post, M.J., van der Putten, P., van Rijn, J.N. Does Feature Selection Improve Clas- sification? A Large Scale Experiment in OpenML. In: International Symposium on Intelligent Data Analysis, pp. 158–170. Springer, Stockholm (2016) 8. Shilbayeh, S., Vadera, S.: Feature selection in meta learning framework. In: Science and Information Conference, pp. 269–275. IEEE, London (2014) 9. Tang, J., Alelyani, S.,Liu, H.: Feature selection for classification: A review. Data Clas- sification: Algorithms and Applications. 37, 1–33 (2014) 10. Vilalta, R. and Drissi, Y.: A perspective view and survey of meta-learning. Artificial Intelligence Review. 18, 77–95 (2002) 11. Wang, G.,Song, Q., Sun, H., Zhang, X., Xu, B., Zhou, Y.: A feature subset selection al- gorithm automatic recommendation method. Journal of Artificial Intelligence Research. 47, 1–34 (2013) 12. Wolpert, D. H., Macready, W.G.: No free lunch theorems for optimization. IEEE transactions on evolutionary computation. 37, 67–82 (1997)