A Framework To Decompose And Develop Metafeatures Fábio Pinto1 and Carlos Soares2 and João Mendes-Moreira3 Abstract. A This paper proposes a framework to decompose and develop F metafeatures for Metalearning (MtL) problems. Several metafeatures Dataset B (also known as data characteristics) are proposed in the literature for Dataset Metafeatures Metadata a wide range of problems. Since MtL applicability is very general but problem dependent, researchers focus on generating specific and yet informative metafeatures for each problem. This process is car- ried without any sort of conceptual framework. We believe that such Choose Learning Performance Learning framework would open new horizons on the development of metafea- Techniques Strategy Evaluation tures and also aid the process of understanding the metafeatures al- C D E ready proposed in the state-of-the-art. We propose a framework with the aim of fill that gap and we show its applicability in a scenario of algorithm recommendation for regression problems. Figure 1. Metalearning: knowledge acquisition. Adapted from [2] 1 Introduction some characteristic of a model generated by applying a learning al- Researchers have been using MtL to overcome innumerous gorithm to a dataset, i.e., the number of leaf nodes of decision tree. challenges faced by several data mining practitioners, such as Finally, a metafeature can also be a 3) landmarker [14]. These are algorithm selection [3][23], time series forecasting [9], data generated by making a quick performance estimate of a learning al- streams [19][20][5], parameter tuning [22] or understanding of learn- gorithm in a particular dataset. ing behavior [6]. Although the state-of-the-art proposes several metafeatures of all As the study of principled methods that exploit metaknowledge types for a wide range of problems, we state that the literature lacks to obtain efficient models and solutions by adapting machine learn- an unifying framework to categorize and develop new metafeatures. ing and data mining processes [2], MtL is used to extrapolate knowl- Such framework could help MtL users by systematizing the process edge gained in previous experiments to better manage new problems. of generating new metafeatures. Furthermore, the framework could That knowledge is stored as metadata, particularly, metafeatures and be very useful to compare different metafeatures and assess if there metatarget, as outlined in Figure 1. The metafeatures (extracted from is no overlap of the information that they capture. In this paper, we A to B and stored in F) consist in data characteristics that describe propose a framework with that purpose and we use it in the analysis the correlation between the learning algorithms and the data under of the metafeatures used in several MtL applications. We also show analysis, i.e., correlation between numeric attributes of a dataset. The its applicability to generate metafeatures in a scenario of algorithm metatarget (extracted through C-D-E and stored in F) represents the recommendation for regression problems. meta-variable that one wishes to understand or predict, i.e., the algo- The paper is organized as follows. In Section 2 we present a brief rithm with best performance for a given dataset. overview of MtL applications and respective metafeatures. Section 3 Independently of the problem at hands, the main issue in MtL con- details the proposed framework to decompose and develop metafea- cerns defining the metafeatures. If the user is able to generate in- tures. In Section 4 we use the framework to decompose and under- formative metafeatures, it is very likely that his application of MtL stand how our framework would characterize metafeatures already is going to be successful. The state-of-the-art shows that there is proposed in the literature. Section 5 exemplifies how the framework three types of metafeatures: 1) simple, statistical and information- could be used to develop new metafeatures in a scenario of algorithm theoretic. In this group we can found the number of examples of recommendation for regression problems. Finally, we conclude the the dataset, correlation between numeric attributes or class entropy, paper with some final remarks and future work. to name a few. Application of these kind of metafeatures provides not only informative metafeatures but also interpretable knowledge about the problems [3] 2) model-based ones [13]. These capture 2 Metalearning 1 LIAAD-INESC TEC, Universidade do Porto, Portugal, e-mail: fh- MtL emerges as the most promising solution from machine learn- pinto@inesctec.pt ing researchers to the need for an intelligent assistant for data analy- 2 CESE-INESC TEC, Universidade do Porto, Portugal, e-mail: csoares@fe.up.pt sis [21]. Since the majority of data mining processes include several 3 LIAAD-INESC TEC, Universidade do Porto, Portugal, e-mail: jmor- non-trivial decisions, it would be useful to have a system that could eira@fe.up.pt guide the users to analyze their data. The main focus of MtL research has been the problem of algo- 3 Metafeatures Development Framework rithm recommendation. Several works proposed systems in which data characteristics were related with the performance of learning In this section, we propose a framework in order to allow a more algorithms in different datasets. Brazdil et al. [3] system provides systematized and standardized development of metafeatures for MtL recommendations in the form of rankings of learning algorithms. Be- problems. This framework splits the conception of a metafeature into sides the MtL system, they also proposed an evaluation methodology four components: object, role, data domain and aggregation function. for ranking problems that is useful for the problem of algorithm rank- Within each component, the metafeature can be generated by using ing. Sun and Pfahringer [23] extended the work of Brazdil et al. with different subcomponents. Figure 2 illustrates the framework. two main contributions: the pairwise meta-rules, generated by com- The object component concerns which information is going to be paring the performance of individual base learners in a one-against- used to compute the metafeature. It can be an instance(s), dataset(s), one manner; and a new meta-learner for ranking algorithms. model(s) or a prediction(s). The metafeature can extract information Another problem addressed by MtL has been the selection of the from one subcomponent (i.e., class entropy of a dataset), several units best method for time series forecasting. The first attempt was carried of a subcomponent (i.e., mean class entropy of a subset of datasets) by Prudêncio and Ludermir [16] with two different systems: one that and for some problems it might be useful to select multiple subcom- was able to select among two models to forecast stationary time se- ponents (i.e., for dynamic selection of models, one could relate in- ries and another to rank three models used to forecast time series. stances with models [11]). Results of both systems were satisfactory. Wang et al. [26] addressed The role component details the function of the object component the same problem but with a descriptive MtL approach. Their goal that is going to be used to generate the metafeature. The focus can was to extract useful rules with metaknowledge that could aid the be in the target variable, predicted or observed, in a feature or in users in selecting the best forecasting method for a given time series the structure of the object component (i.e., decision tree model or and develop a strategy to combine the forecasts. Lemke and Bog- the representation of a dataset into a graph). Several elements can be dan [9] published a similar but with more emphasis on improving selected, i.e., the metafeature can relate the target variable with one forecasts through model selection and combination. or more features. MtL has also been used to tune parameters of learning algorithms. The third component defines the data domain of the metafeature Soares et al. [22] proposed a method that by using mainly simple, and it is decomposed into four subcomponents: quantitative, qualita- statistical and information-theoretic metafeatures was able to pre- tive, mixed or complex. This component is highly dependent of the dict successfully the width of the Gaussian kernel in Support Vector previous ones and influences the metric used for computation (i.e., Regression. Results show that the methodology can select settings if the data domain is qualitative, the user can not use correlation to with low error while providing significant savings in time. Ali and capture the information). A metric can be quantitative (if the object Miles [1] published a MtL method to automatically select the kernel component is numerical), qualitative (if the object component is cat- of a Support Vector Machine in a classification context, reporting re- egorical), mixed (if the object component has both numerical and sults with high accuracy ratings. Reif et al. [17] used a MtL system categorical data) or complex (in special situations in which the ob- to provide good starting points to a genetic algorithm that optimizes ject is a graph or a model). the parameters of a Support Vector Machine and a Random Forests Finally, the aggregation function component. Typically, this is ac- classifier. Results state the effectiveness of the approach. complished by some descriptive statistic, i.e., mean, standard devia- Data stream mining can also benefit from MtL, especially in a con- tion, mode, etc. However, for some MtL problems it might be useful text where the distribution underlying the observations may change to not aggregate the information computed with the metric compo- over time. Gama and Kosina [5] proposed a metalearning frame- nent. This is particularly frequent in MtL applications such as time work that is able to detect recurrence of contexts and use previ- series or data streams [20] were the data has the same morphology. ously learned models. Their approach differs from the typical MtL For example, instead of computing the mean of the correlation be- approach in the sense that uses the base-level features to train the tween pairs of numerical attributes, one could use the correlation be- metamodel. On the other hand, Rossi et al. [19] reported a system for tween all pairs of numerical attributes. periodic algorithm selection that uses data characteristics to induce the metamodel (all the metafeatures are of the simple, statistical and 4 Decomposing Metafeatures information-theoretic type). We used the framework to decompose metafeatures proposed in sev- Another interesting application of MtL is to use it as a methodol- eral applications to assess its applicability and consistence. We show ogy to investigate the reasons behind the success or failure of a learn- examples from the three types of state-of-the-art metafeatures: sim- ing algorithm [25]. In this approach, instead of the typical predictive ple, statistical and information-theoretic; model-based and landmark- methodology, MtL is used to study the relation between the gener- ers. ated metafeatures and a metatarget that represents the base-level phe- Figure 3 illustrates the decomposition of six simple, statistical and nomenon that one wishes to understand. Kalousis et al. [6] published information-theoretic metafeatures. The first three (number of ex- a paper on this matter. They adress the problem of discovering sim- amples, class entropy and absolute mean correlation between nu- ilarities among classification algorithms and among datasets using meric attributes) are common metafeatures used in several published simple, statistical and information-theoretic metafeatures. papers [3][6][22]. The framework allows to detail the computation All the MtL applications that we mentioned previously use differ- of the metafeature. Furthermore, it allows to compare two or more ents sets of metafeatures. It is mandatory to adapt the set of metafea- metafeatures. For example, the absolute mean correlation between tures to the problem domain. However, as stated previously, we be- numeric attributes is very similar to correlation between numeric at- lieve that would be useful to decompose all these metafeatures into tributes (used in data streams applications [20]) except for the aggre- a common framework. Furthermore, such framework must also help gation function. In this case, the application domain makes it feasible the MtL user in the development of new metafeatures. and potentially more informative to not aggregate the correlation val- ues. # # OBJECT ROLE DATA DOMAIN AGGREGATON FUNCTION 1 1 Target Instance 2 Observed 2 Quantitative n n Non- aggregated Dataset Target Qualitative Predicted Model Feature Mixed Descriptive Statistic Predictions Structure Complex Figure 2. Metafeatures Development Framework. compute typical measures such as number of nodes or average de- gree. Another example would be the Jensen-Shannon distance be- tween dataset and bootstrap [15]. In this example, the authors used the Jensen-Shannon distance to measure the differences caused by the bootstrapping process in the distribution of the variables (features and target). In Figure 4, we show an example of a model-based metafeature de- composed using our framework. For computing the number of nodes of a decision tree, the object component is the model, with particular focus on its structure (as role component). Peng et al.[13] published a paper in which several model-based metafeatures are proposed (for decision trees models). Figure 4. Model-based metafeatures decomposed using our framework. Finally, in Figure 5, we show the framework applied to landmark- ers. The first example, the decision stump landmarker [4], uses as ob- Figure 3. Simple, statistical and information-theoretic metafeatures decomposed using our framework. ject a set of predictions, both the predicted and the observed. Assum- ing a 0-1 loss function for classification problems, the data domain of a decision stump landmarker is always quantitative. Last but not Still regarding Figure 3, the decomposition of the two last metafea- least, the aggregation function in this case is a descriptive statistic, tures shows that is possible to use the framework for more com- usually a mean. The second example concerns the metafeatures used plex data characteristics. Morais and Prati [12] published a paper in in the meta decision trees proposed by Todorovski and Džeroski [24]. which they use measures from complex network theory to charac- The authors used the class probabilities of the base-level classifiers as terize a dataset. Their approach consists in transforming the dataset metafeature, particularly, the highest class probability of a classifier into a graph by means of similarity between instances. Then, they for a single instance. of redundancy is a metric of information. The greater the correla- tion between a numeric feature and target, the more informative that feature can be. Furthermore, it can be more useful to use a specific descriptive statistic (maximum, minimum, etc) instead of the typical mean. Similarly, the correlation between numeric features and target has the same purpose of distribution of correlation between numeric fea- tures and target but it is indicated for MtL in which the base-level data has the same morphology (as in the data streams scenario [20]). The output of the metafeature is the correlation between the target and each numeric feature. Figure 5. Landmarkers metafeatures decomposed using our framework. The two last metafeatures presented in Figure 6, (correlation be- tween predictions and target and absolute mean correlation between numeric features and target of two instances) were developed us- 5 Developing Metafeatures ing our framework by changing elements of specific components. Correlation between numeric predictions and target is another form In this Section we present a case study of the proposed framework of landmarker in which instead of using a typical error measure as with a metric widely used in MtL problems [2]: correlation be- RMSE, one uses correlation to assess the similarity between the real tween numeric variables. We show that it is possible to generate new values and the predicted ones. In terms of the framework decomposi- metafeatures by combining elements of different components of the tion, this metafeature differs from the typical landmarkers in the ag- framework. Furthermore, using such framework allows a systematic gregation function component. Although we did not yet executed ex- reasoning in the process of developing metafeatures for a given prob- periments on the usefulness of metafeature, it is here proposed to ex- lem. It becomes easier to detect gaps of non measured information in emplify the applicability of the framework to uncover new metafea- a set of metafeatures, if it is available a theoretical framework that tures for a given problem. can guide the user by pointing new research directions. Finally, the distribution of correlation between numeric features and target of instances can be particularly useful for dynamic se- ROLE DATA AGGREGATION lection of algorithms/models in a regression scenario [18][10]. If the METAFEATURE OBJECT DOMAIN FUNCTION MtL problem concerns the selection of an algorithm for each instance Distribution of Target of the test set (instead of an algorithm for a dataset) it could be useful correlation Observed Descriptive Quantitative between numeric Dataset Statistic to collect information that relates instances. This metafeature would features and Feature target allow to measure the correlation between the numeric variables of the instances. Once again, to the best of our knowledge, there are Correlation between numeric Target Observed Non- no reported experiments on the dynamic selection of algorithms us- Dataset Quantitative aggregated features and target Feature ing MtL. This metafeature is here proposed as another example of metafeatures that can be developed using correlation as metric. Correlation Target between Predictions Observed Non- predictions and Target Quantitative aggregated 6 Final Remarks and Future Work target Predicted This paper proposes a framework to decompose and develop new Distribution of correlation Target metafeatures for MtL problems. We believe that such framework between numeric Instance Observed Descriptive Quantitative features and Statistic can assist MtL researchers and users by standardizing the concept Feature target of instances of metafeature. We presented the framework and we used it to analyze several metafeatures proposed in the literature for a wide range of MtL sce- narios. This process allowed to validate the usefulness of the frame- Figure 6. Examples of correlation metafeatures developed using the work by distinguishing several state-of-the-art metafeatures. We also proposed framework. provide insights on how the framework can be used to develop new metafeatures for a algorithm recommendation in a regression sce- nario. We use correlation between numeric variables to exemplify As mentioned previously, we use correlation between numeric the applicability of the framework. variables as example in the context of a MtL application for regres- As for future work, we plan to use this framework to generate sion algorithm selection [7]. This a problem addressed in a relatively new metafeatures for algorithm recommendation in a classification small number of papers in comparison with the classification sce- scenario and empirically validate the framework. Furthermore, we nario. also plan to use the framework in MtL problems that we have been Figure 6 shows an illustration of four metafeatures that use corre- working on, particularly, MtL for pruning of bagging ensembles and lation between numeric variables. The first metafeature, distribution dynamic integration of models. of correlation between numeric features and target, although present in the literature [2], differs from absolute mean correlation between Acknowledgements numeric features presented in Figure 3 by adding the element target to the role component. This simple change transforms completely the This work is partially funded by FCT/MEC through PIDDAC nature of the metafeature in the sense that instead of being a metric and ERDF/ON2 within project NORTE-07-0124-FEDER-000059, a project financed by the North Portugal Regional Operational Pro- A meta-learning based method for periodic algorithm selection in time- gramme (ON.2 O Novo Norte), under the National Strategic Refer- changing data’, Neurocomputing, 127, 52–64, (2014). [21] Floarea Serban, Joaquin Vanschoren, Jörg-Uwe Kietz, and Abraham ence Framework (NSRF), through the European Regional Develop- Bernstein, ‘A survey of intelligent assistants for data analysis’, ACM ment Fund (ERDF), and by national funds, through the Portuguese Computing Surveys (CSUR), 45(3), 31, (2013). funding agency, Fundação para a Ciência e a Tecnologia (FCT). [22] Carlos Soares, Pavel B Brazdil, and Petr Kuba, ‘A meta-learning method to select the kernel width in support vector regression’, Ma- chine Learning, 54(3), 195–209, (2004). [23] Quan Sun and Bernhard Pfahringer, ‘Pairwise meta-rules for bet- REFERENCES ter meta-learning-based algorithm ranking’, Machine learning, 93(1), 141–161, (2013). [1] Shawkat Ali and Kate A Smith-Miles, ‘A meta-learning approach to [24] Ljupčo Todorovski and Sašo Džeroski, ‘Combining classifiers with automatic kernel selection for support vector machines’, Neurocomput- meta decision trees’, Machine learning, 50(3), 223–249, (2003). ing, 70(1), 173–186, (2006). [25] Joaquin Vanschoren and Hendrik Blockeel, ‘Towards understanding [2] Pavel Brazdil, Christophe Giraud Carrier, Carlos Soares, and Ricardo learning behavior’, in Proceedings of the Annual Machine Learning Vilalta, Metalearning: applications to data mining, Springer, 2008. Conference of Belgium and the Netherlands, pp. 89–96, (2006). [3] Pavel B Brazdil, Carlos Soares, and Joaquim Pinto Da Costa, ‘Ranking [26] Xiaozhe Wang, Kate Smith-Miles, and Rob Hyndman, ‘Rule induction learning algorithms: Using ibl and meta-learning on accuracy and time for forecasting method selection: Meta-learning the characteristics of results’, Machine Learning, 50(3), 251–277, (2003). univariate time series’, Neurocomputing, 72(10), 2581–2594, (2009). [4] Johannes Fürnkranz and Johann Petrak, ‘An evaluation of landmark- ing variants’, in Working Notes of the ECML/PKDD 2000 Workshop on Integrating Aspects of Data Mining, Decision Support and Meta- Learning, pp. 57–68, (2001). [5] João Gama and Petr Kosina, ‘Recurrent concepts in data streams clas- sification’, Knowledge and Information Systems, 1–19, (2013). [6] Alexandros Kalousis, João Gama, and Melanie Hilario, ‘On data and algorithms: Understanding inductive performance’, Machine Learning, 54(3), 275–312, (2004). [7] Christian Köpf, Charles Taylor, and Jörg Keller, ‘Meta-analysis: from data characterisation for meta-learning to meta-regression’, in Proceed- ings of the PKDD-00 workshop on data mining, decision support, meta- learning and ILP. Citeseer, (2000). [8] Petr Kuba, Pavel Brazdil, Carlos Soares, and Adam Woznica, ‘Exploit- ing sampling and meta-learning for parameter setting for support vector machines’, in Proc. of Workshop Learning and Data Mining associated with Iberamia 2002, VIII Iberoamerican Conference on Artificial Intel- lignce, pp. 209–216, Sevilla (Spain), (2002). University of Sevilla. [9] Christiane Lemke and Bogdan Gabrys, ‘Meta-learning for time series forecasting and forecast combination’, Neurocomputing, 73(10), 2006– 2016, (2010). [10] João Mendes-Moreira, Alipio Mario Jorge, Carlos Soares, and Jorge Freire de Sousa, ‘Ensemble learning: A study on different vari- ants of the dynamic selection approach’, in Machine Learning and Data Mining in Pattern Recognition, 191–205, Springer, (2009). [11] João Mendes-Moreira, Carlos Soares, Alı́pio Mário Jorge, and Jorge Freire De Sousa, ‘Ensemble approaches for regression: A survey’, ACM Computing Surveys (CSUR), 45(1), 10, (2012). [12] Gleison Morais and Ronaldo C Prati, ‘Complex network measures for data set characterization’, in Intelligent Systems (BRACIS), 2013 Brazil- ian Conference on, pp. 12–18. IEEE, (2013). [13] Yonghong Peng, Peter A Flach, Carlos Soares, and Pavel Brazdil, ‘Im- proved dataset characterisation for meta-learning’, in Discovery Sci- ence, pp. 141–152. Springer, (2002). [14] Bernhard Pfahringer, Hilan Bensusan, and Christophe Giraud-Carrier, ‘Tell me who can learn you and i can tell you who you are: Landmark- ing various learning algorithms’, in Proceedings of the 17th interna- tional conference on machine learning, pp. 743–750, (2000). [15] Fábio Pinto, Carlos Soares, and João Mendes-Moreira, ‘An empirical methodology to analyze the behavior of bagging’, in Submitted for pub- lication, (2014). [16] Ricardo BC Prudêncio and Teresa B Ludermir, ‘Meta-learning ap- proaches to selecting time series models’, Neurocomputing, 61, 121– 137, (2004). [17] Matthias Reif, Faisal Shafait, and Andreas Dengel, ‘Meta-learning for evolutionary parameter optimization of classifiers’, Machine learning, 87(3), 357–380, (2012). [18] Niall Rooney, David Patterson, Sarab Anand, and Alexey Tsymbal, ‘Dynamic integration of regression models’, in Multiple Classifier Sys- tems, 164–173, Springer, (2004). [19] André Luis Debiaso Rossi, ACPLF Carvalho, and Carlos Soares, ‘Meta-learning for periodic algorithm selection in time-changing data’, in Neural Networks (SBRN), 2012 Brazilian Symposium on, pp. 7–12. IEEE, (2012). [20] André Luis Debiaso Rossi, André Carlos Ponce De Leon Ferreira De Carvalho, Carlos Soares, and Bruno Feres De Souza, ‘Metastream: