A Framework To Decompose And Develop Metafeatures
                                Fábio Pinto1 and Carlos Soares2 and João Mendes-Moreira3


Abstract.                                                                         A
   This paper proposes a framework to decompose and develop
                                                                                                                                             F
metafeatures for Metalearning (MtL) problems. Several metafeatures              Dataset                          B

(also known as data characteristics) are proposed in the literature for          Dataset                   Metafeatures                   Metadata
a wide range of problems. Since MtL applicability is very general
but problem dependent, researchers focus on generating specific and
yet informative metafeatures for each problem. This process is car-
ried without any sort of conceptual framework. We believe that such                                          Choose
                                                                                            Learning                        Performance
                                                                                                             Learning
framework would open new horizons on the development of metafea-                           Techniques
                                                                                                             Strategy
                                                                                                                             Evaluation

tures and also aid the process of understanding the metafeatures al-                          C                 D                E
ready proposed in the state-of-the-art. We propose a framework with
the aim of fill that gap and we show its applicability in a scenario of
algorithm recommendation for regression problems.                              Figure 1. Metalearning: knowledge acquisition. Adapted from [2]


1    Introduction
                                                                           some characteristic of a model generated by applying a learning al-
Researchers have been using MtL to overcome innumerous                     gorithm to a dataset, i.e., the number of leaf nodes of decision tree.
challenges faced by several data mining practitioners, such as             Finally, a metafeature can also be a 3) landmarker [14]. These are
algorithm selection [3][23], time series forecasting [9], data             generated by making a quick performance estimate of a learning al-
streams [19][20][5], parameter tuning [22] or understanding of learn-      gorithm in a particular dataset.
ing behavior [6].                                                             Although the state-of-the-art proposes several metafeatures of all
   As the study of principled methods that exploit metaknowledge           types for a wide range of problems, we state that the literature lacks
to obtain efficient models and solutions by adapting machine learn-        an unifying framework to categorize and develop new metafeatures.
ing and data mining processes [2], MtL is used to extrapolate knowl-       Such framework could help MtL users by systematizing the process
edge gained in previous experiments to better manage new problems.         of generating new metafeatures. Furthermore, the framework could
That knowledge is stored as metadata, particularly, metafeatures and       be very useful to compare different metafeatures and assess if there
metatarget, as outlined in Figure 1. The metafeatures (extracted from      is no overlap of the information that they capture. In this paper, we
A to B and stored in F) consist in data characteristics that describe      propose a framework with that purpose and we use it in the analysis
the correlation between the learning algorithms and the data under         of the metafeatures used in several MtL applications. We also show
analysis, i.e., correlation between numeric attributes of a dataset. The   its applicability to generate metafeatures in a scenario of algorithm
metatarget (extracted through C-D-E and stored in F) represents the        recommendation for regression problems.
meta-variable that one wishes to understand or predict, i.e., the algo-       The paper is organized as follows. In Section 2 we present a brief
rithm with best performance for a given dataset.                           overview of MtL applications and respective metafeatures. Section 3
   Independently of the problem at hands, the main issue in MtL con-       details the proposed framework to decompose and develop metafea-
cerns defining the metafeatures. If the user is able to generate in-       tures. In Section 4 we use the framework to decompose and under-
formative metafeatures, it is very likely that his application of MtL      stand how our framework would characterize metafeatures already
is going to be successful. The state-of-the-art shows that there is        proposed in the literature. Section 5 exemplifies how the framework
three types of metafeatures: 1) simple, statistical and information-       could be used to develop new metafeatures in a scenario of algorithm
theoretic. In this group we can found the number of examples of            recommendation for regression problems. Finally, we conclude the
the dataset, correlation between numeric attributes or class entropy,      paper with some final remarks and future work.
to name a few. Application of these kind of metafeatures provides
not only informative metafeatures but also interpretable knowledge
about the problems [3] 2) model-based ones [13]. These capture             2   Metalearning
1  LIAAD-INESC TEC, Universidade do Porto, Portugal, e-mail: fh-           MtL emerges as the most promising solution from machine learn-
  pinto@inesctec.pt                                                        ing researchers to the need for an intelligent assistant for data analy-
2 CESE-INESC TEC, Universidade do Porto, Portugal, e-mail:
  csoares@fe.up.pt                                                         sis [21]. Since the majority of data mining processes include several
3 LIAAD-INESC TEC, Universidade do Porto, Portugal, e-mail: jmor-          non-trivial decisions, it would be useful to have a system that could
  eira@fe.up.pt                                                            guide the users to analyze their data.
   The main focus of MtL research has been the problem of algo-            3   Metafeatures Development Framework
rithm recommendation. Several works proposed systems in which
data characteristics were related with the performance of learning         In this section, we propose a framework in order to allow a more
algorithms in different datasets. Brazdil et al. [3] system provides       systematized and standardized development of metafeatures for MtL
recommendations in the form of rankings of learning algorithms. Be-        problems. This framework splits the conception of a metafeature into
sides the MtL system, they also proposed an evaluation methodology         four components: object, role, data domain and aggregation function.
for ranking problems that is useful for the problem of algorithm rank-     Within each component, the metafeature can be generated by using
ing. Sun and Pfahringer [23] extended the work of Brazdil et al. with      different subcomponents. Figure 2 illustrates the framework.
two main contributions: the pairwise meta-rules, generated by com-            The object component concerns which information is going to be
paring the performance of individual base learners in a one-against-       used to compute the metafeature. It can be an instance(s), dataset(s),
one manner; and a new meta-learner for ranking algorithms.                 model(s) or a prediction(s). The metafeature can extract information
   Another problem addressed by MtL has been the selection of the          from one subcomponent (i.e., class entropy of a dataset), several units
best method for time series forecasting. The first attempt was carried     of a subcomponent (i.e., mean class entropy of a subset of datasets)
by Prudêncio and Ludermir [16] with two different systems: one that       and for some problems it might be useful to select multiple subcom-
was able to select among two models to forecast stationary time se-        ponents (i.e., for dynamic selection of models, one could relate in-
ries and another to rank three models used to forecast time series.        stances with models [11]).
Results of both systems were satisfactory. Wang et al. [26] addressed         The role component details the function of the object component
the same problem but with a descriptive MtL approach. Their goal           that is going to be used to generate the metafeature. The focus can
was to extract useful rules with metaknowledge that could aid the          be in the target variable, predicted or observed, in a feature or in
users in selecting the best forecasting method for a given time series     the structure of the object component (i.e., decision tree model or
and develop a strategy to combine the forecasts. Lemke and Bog-            the representation of a dataset into a graph). Several elements can be
dan [9] published a similar but with more emphasis on improving            selected, i.e., the metafeature can relate the target variable with one
forecasts through model selection and combination.                         or more features.
   MtL has also been used to tune parameters of learning algorithms.          The third component defines the data domain of the metafeature
Soares et al. [22] proposed a method that by using mainly simple,          and it is decomposed into four subcomponents: quantitative, qualita-
statistical and information-theoretic metafeatures was able to pre-        tive, mixed or complex. This component is highly dependent of the
dict successfully the width of the Gaussian kernel in Support Vector       previous ones and influences the metric used for computation (i.e.,
Regression. Results show that the methodology can select settings          if the data domain is qualitative, the user can not use correlation to
with low error while providing significant savings in time. Ali and        capture the information). A metric can be quantitative (if the object
Miles [1] published a MtL method to automatically select the kernel        component is numerical), qualitative (if the object component is cat-
of a Support Vector Machine in a classification context, reporting re-     egorical), mixed (if the object component has both numerical and
sults with high accuracy ratings. Reif et al. [17] used a MtL system       categorical data) or complex (in special situations in which the ob-
to provide good starting points to a genetic algorithm that optimizes      ject is a graph or a model).
the parameters of a Support Vector Machine and a Random Forests               Finally, the aggregation function component. Typically, this is ac-
classifier. Results state the effectiveness of the approach.               complished by some descriptive statistic, i.e., mean, standard devia-
   Data stream mining can also benefit from MtL, especially in a con-      tion, mode, etc. However, for some MtL problems it might be useful
text where the distribution underlying the observations may change         to not aggregate the information computed with the metric compo-
over time. Gama and Kosina [5] proposed a metalearning frame-              nent. This is particularly frequent in MtL applications such as time
work that is able to detect recurrence of contexts and use previ-          series or data streams [20] were the data has the same morphology.
ously learned models. Their approach differs from the typical MtL          For example, instead of computing the mean of the correlation be-
approach in the sense that uses the base-level features to train the       tween pairs of numerical attributes, one could use the correlation be-
metamodel. On the other hand, Rossi et al. [19] reported a system for      tween all pairs of numerical attributes.
periodic algorithm selection that uses data characteristics to induce
the metamodel (all the metafeatures are of the simple, statistical and     4   Decomposing Metafeatures
information-theoretic type).
                                                                           We used the framework to decompose metafeatures proposed in sev-
   Another interesting application of MtL is to use it as a methodol-
                                                                           eral applications to assess its applicability and consistence. We show
ogy to investigate the reasons behind the success or failure of a learn-
                                                                           examples from the three types of state-of-the-art metafeatures: sim-
ing algorithm [25]. In this approach, instead of the typical predictive
                                                                           ple, statistical and information-theoretic; model-based and landmark-
methodology, MtL is used to study the relation between the gener-
                                                                           ers.
ated metafeatures and a metatarget that represents the base-level phe-
                                                                              Figure 3 illustrates the decomposition of six simple, statistical and
nomenon that one wishes to understand. Kalousis et al. [6] published
                                                                           information-theoretic metafeatures. The first three (number of ex-
a paper on this matter. They adress the problem of discovering sim-
                                                                           amples, class entropy and absolute mean correlation between nu-
ilarities among classification algorithms and among datasets using
                                                                           meric attributes) are common metafeatures used in several published
simple, statistical and information-theoretic metafeatures.
                                                                           papers [3][6][22]. The framework allows to detail the computation
   All the MtL applications that we mentioned previously use differ-
                                                                           of the metafeature. Furthermore, it allows to compare two or more
ents sets of metafeatures. It is mandatory to adapt the set of metafea-
                                                                           metafeatures. For example, the absolute mean correlation between
tures to the problem domain. However, as stated previously, we be-
                                                                           numeric attributes is very similar to correlation between numeric at-
lieve that would be useful to decompose all these metafeatures into
                                                                           tributes (used in data streams applications [20]) except for the aggre-
a common framework. Furthermore, such framework must also help
                                                                           gation function. In this case, the application domain makes it feasible
the MtL user in the development of new metafeatures.
                                                                           and potentially more informative to not aggregate the correlation val-
                                                                           ues.
                                   #                                      #
           OBJECT                                   ROLE                             DATA DOMAIN                       AGGREGATON FUNCTION
                                   1                                      1
                                                   Target
           Instance                2              Observed
                                                                          2           Quantitative


                                   n                                      n
                                                                                                                                  Non-
                                                                                                                               aggregated
           Dataset                                  Target
                                                                                       Qualitative
                                                   Predicted


            Model                                  Feature                               Mixed
                                                                                                                               Descriptive
                                                                                                                                Statistic

         Predictions
                                                  Structure                             Complex


                                                   Figure 2. Metafeatures Development Framework.


                                                                              compute typical measures such as number of nodes or average de-
                                                                              gree. Another example would be the Jensen-Shannon distance be-
                                                                              tween dataset and bootstrap [15]. In this example, the authors used
                                                                              the Jensen-Shannon distance to measure the differences caused by
                                                                              the bootstrapping process in the distribution of the variables (features
                                                                              and target).
                                                                                 In Figure 4, we show an example of a model-based metafeature de-
                                                                              composed using our framework. For computing the number of nodes
                                                                              of a decision tree, the object component is the model, with particular
                                                                              focus on its structure (as role component). Peng et al.[13] published
                                                                              a paper in which several model-based metafeatures are proposed (for
                                                                              decision trees models).


                                                                               Figure 4. Model-based metafeatures decomposed using our framework.


                                                                                 Finally, in Figure 5, we show the framework applied to landmark-
                                                                              ers. The first example, the decision stump landmarker [4], uses as ob-
   Figure 3. Simple, statistical and information-theoretic metafeatures
                 decomposed using our framework.                              ject a set of predictions, both the predicted and the observed. Assum-
                                                                              ing a 0-1 loss function for classification problems, the data domain
                                                                              of a decision stump landmarker is always quantitative. Last but not
   Still regarding Figure 3, the decomposition of the two last metafea-       least, the aggregation function in this case is a descriptive statistic,
tures shows that is possible to use the framework for more com-               usually a mean. The second example concerns the metafeatures used
plex data characteristics. Morais and Prati [12] published a paper in         in the meta decision trees proposed by Todorovski and Džeroski [24].
which they use measures from complex network theory to charac-                The authors used the class probabilities of the base-level classifiers as
terize a dataset. Their approach consists in transforming the dataset         metafeature, particularly, the highest class probability of a classifier
into a graph by means of similarity between instances. Then, they             for a single instance.
                                                                                    of redundancy is a metric of information. The greater the correla-
                                                                                    tion between a numeric feature and target, the more informative that
                                                                                    feature can be. Furthermore, it can be more useful to use a specific
                                                                                    descriptive statistic (maximum, minimum, etc) instead of the typical
                                                                                    mean.
                                                                                       Similarly, the correlation between numeric features and target has
                                                                                    the same purpose of distribution of correlation between numeric fea-
                                                                                    tures and target but it is indicated for MtL in which the base-level
                                                                                    data has the same morphology (as in the data streams scenario [20]).
                                                                                    The output of the metafeature is the correlation between the target
                                                                                    and each numeric feature.
 Figure 5. Landmarkers metafeatures decomposed using our framework.
                                                                                       The two last metafeatures presented in Figure 6, (correlation be-
                                                                                    tween predictions and target and absolute mean correlation between
                                                                                    numeric features and target of two instances) were developed us-
5       Developing Metafeatures                                                     ing our framework by changing elements of specific components.
                                                                                    Correlation between numeric predictions and target is another form
In this Section we present a case study of the proposed framework                   of landmarker in which instead of using a typical error measure as
with a metric widely used in MtL problems [2]: correlation be-                      RMSE, one uses correlation to assess the similarity between the real
tween numeric variables. We show that it is possible to generate new                values and the predicted ones. In terms of the framework decomposi-
metafeatures by combining elements of different components of the                   tion, this metafeature differs from the typical landmarkers in the ag-
framework. Furthermore, using such framework allows a systematic                    gregation function component. Although we did not yet executed ex-
reasoning in the process of developing metafeatures for a given prob-               periments on the usefulness of metafeature, it is here proposed to ex-
lem. It becomes easier to detect gaps of non measured information in                emplify the applicability of the framework to uncover new metafea-
a set of metafeatures, if it is available a theoretical framework that              tures for a given problem.
can guide the user by pointing new research directions.                                Finally, the distribution of correlation between numeric features
                                                                                    and target of instances can be particularly useful for dynamic se-
                                       ROLE           DATA          AGGREGATION     lection of algorithms/models in a regression scenario [18][10]. If the
    METAFEATURE        OBJECT
                                                     DOMAIN           FUNCTION      MtL problem concerns the selection of an algorithm for each instance
  Distribution of                     Target                                        of the test set (instead of an algorithm for a dataset) it could be useful
    correlation                      Observed                         Descriptive
                                                     Quantitative
 between numeric        Dataset                                        Statistic    to collect information that relates instances. This metafeature would
   features and                           Feature
      target                                                                        allow to measure the correlation between the numeric variables of
                                                                                    the instances. Once again, to the best of our knowledge, there are
       Correlation
    between numeric
                                      Target
                                     Observed                            Non-
                                                                                    no reported experiments on the dynamic selection of algorithms us-
                        Dataset                      Quantitative
                                                                      aggregated
      features and
         target                           Feature
                                                                                    ing MtL. This metafeature is here proposed as another example of
                                                                                    metafeatures that can be developed using correlation as metric.
       Correlation                    Target
        between        Predictions   Observed                            Non-
     predictions and                      Target
                                                     Quantitative
                                                                      aggregated    6   Final Remarks and Future Work
         target                          Predicted

                                                                                    This paper proposes a framework to decompose and develop new
     Distribution of
        correlation                    Target                                       metafeatures for MtL problems. We believe that such framework
    between numeric     Instance      Observed                        Descriptive
                                                     Quantitative
      features and
                                                                       Statistic    can assist MtL researchers and users by standardizing the concept
                                           Feature
         target of
        instances
                                                                                    of metafeature.
                                                                                       We presented the framework and we used it to analyze several
                                                                                    metafeatures proposed in the literature for a wide range of MtL sce-
                                                                                    narios. This process allowed to validate the usefulness of the frame-
      Figure 6. Examples of correlation metafeatures developed using the            work by distinguishing several state-of-the-art metafeatures. We also
                           proposed framework.                                      provide insights on how the framework can be used to develop new
                                                                                    metafeatures for a algorithm recommendation in a regression sce-
                                                                                    nario. We use correlation between numeric variables to exemplify
   As mentioned previously, we use correlation between numeric                      the applicability of the framework.
variables as example in the context of a MtL application for regres-                   As for future work, we plan to use this framework to generate
sion algorithm selection [7]. This a problem addressed in a relatively              new metafeatures for algorithm recommendation in a classification
small number of papers in comparison with the classification sce-                   scenario and empirically validate the framework. Furthermore, we
nario.                                                                              also plan to use the framework in MtL problems that we have been
   Figure 6 shows an illustration of four metafeatures that use corre-              working on, particularly, MtL for pruning of bagging ensembles and
lation between numeric variables. The first metafeature, distribution               dynamic integration of models.
of correlation between numeric features and target, although present
in the literature [2], differs from absolute mean correlation between
                                                                                    Acknowledgements
numeric features presented in Figure 3 by adding the element target
to the role component. This simple change transforms completely the                 This work is partially funded by FCT/MEC through PIDDAC
nature of the metafeature in the sense that instead of being a metric               and ERDF/ON2 within project NORTE-07-0124-FEDER-000059, a
project financed by the North Portugal Regional Operational Pro-                        A meta-learning based method for periodic algorithm selection in time-
gramme (ON.2 O Novo Norte), under the National Strategic Refer-                         changing data’, Neurocomputing, 127, 52–64, (2014).
                                                                                 [21]   Floarea Serban, Joaquin Vanschoren, Jörg-Uwe Kietz, and Abraham
ence Framework (NSRF), through the European Regional Develop-                           Bernstein, ‘A survey of intelligent assistants for data analysis’, ACM
ment Fund (ERDF), and by national funds, through the Portuguese                         Computing Surveys (CSUR), 45(3), 31, (2013).
funding agency, Fundação para a Ciência e a Tecnologia (FCT).                 [22]   Carlos Soares, Pavel B Brazdil, and Petr Kuba, ‘A meta-learning
                                                                                        method to select the kernel width in support vector regression’, Ma-
                                                                                        chine Learning, 54(3), 195–209, (2004).
                                                                                 [23]   Quan Sun and Bernhard Pfahringer, ‘Pairwise meta-rules for bet-
REFERENCES                                                                              ter meta-learning-based algorithm ranking’, Machine learning, 93(1),
                                                                                        141–161, (2013).
 [1] Shawkat Ali and Kate A Smith-Miles, ‘A meta-learning approach to            [24]   Ljupčo Todorovski and Sašo Džeroski, ‘Combining classifiers with
     automatic kernel selection for support vector machines’, Neurocomput-              meta decision trees’, Machine learning, 50(3), 223–249, (2003).
     ing, 70(1), 173–186, (2006).                                                [25]   Joaquin Vanschoren and Hendrik Blockeel, ‘Towards understanding
 [2] Pavel Brazdil, Christophe Giraud Carrier, Carlos Soares, and Ricardo               learning behavior’, in Proceedings of the Annual Machine Learning
     Vilalta, Metalearning: applications to data mining, Springer, 2008.                Conference of Belgium and the Netherlands, pp. 89–96, (2006).
 [3] Pavel B Brazdil, Carlos Soares, and Joaquim Pinto Da Costa, ‘Ranking        [26]   Xiaozhe Wang, Kate Smith-Miles, and Rob Hyndman, ‘Rule induction
     learning algorithms: Using ibl and meta-learning on accuracy and time              for forecasting method selection: Meta-learning the characteristics of
     results’, Machine Learning, 50(3), 251–277, (2003).                                univariate time series’, Neurocomputing, 72(10), 2581–2594, (2009).
 [4] Johannes Fürnkranz and Johann Petrak, ‘An evaluation of landmark-
     ing variants’, in Working Notes of the ECML/PKDD 2000 Workshop
     on Integrating Aspects of Data Mining, Decision Support and Meta-
     Learning, pp. 57–68, (2001).
 [5] João Gama and Petr Kosina, ‘Recurrent concepts in data streams clas-
     sification’, Knowledge and Information Systems, 1–19, (2013).
 [6] Alexandros Kalousis, João Gama, and Melanie Hilario, ‘On data and
     algorithms: Understanding inductive performance’, Machine Learning,
     54(3), 275–312, (2004).
 [7] Christian Köpf, Charles Taylor, and Jörg Keller, ‘Meta-analysis: from
     data characterisation for meta-learning to meta-regression’, in Proceed-
     ings of the PKDD-00 workshop on data mining, decision support, meta-
     learning and ILP. Citeseer, (2000).
 [8] Petr Kuba, Pavel Brazdil, Carlos Soares, and Adam Woznica, ‘Exploit-
     ing sampling and meta-learning for parameter setting for support vector
     machines’, in Proc. of Workshop Learning and Data Mining associated
     with Iberamia 2002, VIII Iberoamerican Conference on Artificial Intel-
     lignce, pp. 209–216, Sevilla (Spain), (2002). University of Sevilla.
 [9] Christiane Lemke and Bogdan Gabrys, ‘Meta-learning for time series
     forecasting and forecast combination’, Neurocomputing, 73(10), 2006–
     2016, (2010).
[10] João Mendes-Moreira, Alipio Mario Jorge, Carlos Soares, and
     Jorge Freire de Sousa, ‘Ensemble learning: A study on different vari-
     ants of the dynamic selection approach’, in Machine Learning and Data
     Mining in Pattern Recognition, 191–205, Springer, (2009).
[11] João Mendes-Moreira, Carlos Soares, Alı́pio Mário Jorge, and Jorge
     Freire De Sousa, ‘Ensemble approaches for regression: A survey’, ACM
     Computing Surveys (CSUR), 45(1), 10, (2012).
[12] Gleison Morais and Ronaldo C Prati, ‘Complex network measures for
     data set characterization’, in Intelligent Systems (BRACIS), 2013 Brazil-
     ian Conference on, pp. 12–18. IEEE, (2013).
[13] Yonghong Peng, Peter A Flach, Carlos Soares, and Pavel Brazdil, ‘Im-
     proved dataset characterisation for meta-learning’, in Discovery Sci-
     ence, pp. 141–152. Springer, (2002).
[14] Bernhard Pfahringer, Hilan Bensusan, and Christophe Giraud-Carrier,
     ‘Tell me who can learn you and i can tell you who you are: Landmark-
     ing various learning algorithms’, in Proceedings of the 17th interna-
     tional conference on machine learning, pp. 743–750, (2000).
[15] Fábio Pinto, Carlos Soares, and João Mendes-Moreira, ‘An empirical
     methodology to analyze the behavior of bagging’, in Submitted for pub-
     lication, (2014).
[16] Ricardo BC Prudêncio and Teresa B Ludermir, ‘Meta-learning ap-
     proaches to selecting time series models’, Neurocomputing, 61, 121–
     137, (2004).
[17] Matthias Reif, Faisal Shafait, and Andreas Dengel, ‘Meta-learning for
     evolutionary parameter optimization of classifiers’, Machine learning,
     87(3), 357–380, (2012).
[18] Niall Rooney, David Patterson, Sarab Anand, and Alexey Tsymbal,
     ‘Dynamic integration of regression models’, in Multiple Classifier Sys-
     tems, 164–173, Springer, (2004).
[19] André Luis Debiaso Rossi, ACPLF Carvalho, and Carlos Soares,
     ‘Meta-learning for periodic algorithm selection in time-changing data’,
     in Neural Networks (SBRN), 2012 Brazilian Symposium on, pp. 7–12.
     IEEE, (2012).
[20] André Luis Debiaso Rossi, André Carlos Ponce De Leon Ferreira
     De Carvalho, Carlos Soares, and Bruno Feres De Souza, ‘Metastream: