Predicting the Accuracy of Regression Models in the
                         Retail Industry1
                                                      Fábio Pinto2 and Carlos Soares3


Abstract. Companies are moving from developing a single model                 are also similar). Therefore, the process of model building should not
for a problem (e.g., a regression model to predict general sales) to          be independent. The knowledge obtained from generating a model
developing several models for sub-problems of the original problem            for one sub-problem can and should be applied to the process of de-
(e.g., regression models to predict sales of each of its product cate-        veloping the model for the other sub-problems. Different approaches
gories). Given the similarity between the sub-problems, the process           can be used for that purpose, two of them being MtL and transfer
of model development should not be independent. Information                   learning [1].
should be shared between processes. Different approaches can be                  Our goals with this work is to use metalearning (MtL) to predict
used for that purpose, including metalearning (MtL) and transfer              the performance of one model based on the performance of models
learning. In this work, we use MtL to predict the performance of              that were previously developed to predict sales of product categories
a model based on the performance of models that were previously               in a retail company in Portugal, and unveil the attributes that are more
developed. Given that the sub-problems are related (e.g., the schemas         important for that prediction. The paper is organised as follows. In
of the data are the same), domain knowledge is used to develop                Section 2, we briefly survey the concept of MtL and the importance
the metafeatures that characterize them. The approach is applied              of metafeatures. Our case study is presented in Section 3. Finally, in
to the development of models to predict sales of different product            Section 4 we expose some conclusions.
categories in a retail company from Portugal.

                                                                              2   Metafeatures for Metalearning
1    Introduction                                                             MtL can be defined as the use of data about the performance of
                                                                              machine learning algorithms on previous problems to predict their
The retail industry is a world of extreme competitiveness. Compa-             performance on future ones [1]. For more information on MtL, we
nies struggle on a daily basis for the loyalty of their clients through       refer the reader to [5, 1].
diverse marketing actions, while providing better products, better               One of the essential issues about MtL are the metafeatures
prices and better services. The growing need for analytic tools that          that characterize the problem. Which metafeatures contain useful
enhance retailers performance is unquestionable, and Data Mining              information to predict the performance of an algorithm on a given
(DM) is central in this trend [2].                                            problem? Much work has been done on this topic (e.g., [3]). Typi-
   Sales prediction is one of the main tasks in retail. The ability to as-    cally the work on MtL includes problems from different domains, so
sess the impact that a sudden change in a particular factor will have         the metafeatures need to be very generic (e.g., number of attributes
on the sales of one or more products is a major tool for retailers. DM        and mutual information between symbolic attributes and the target).
is one of the approaches for this task.                                       However, in more specific settings, metafeatures should encode
   In early approaches to predict sales, a single model could be used         more particular information about the data, which probably contain
for a whole business. As more detailed data becomes available, retail         useful information about the performance of the algorithms.
companies are dividing the problem into several sub-problems (pre-
dict the sales of each of its stores or product categories). The same
trend can be observed in several industries [6].
   In this approach, there are obviously many similarities between the        3   Case Study
sub-problems. Not only is the structure of the data typically the same
across all sub-problems (e.g., the variables are the same and their           The base-level data used in this study was collected to model
domains are similar) but also the patterns in that data may have sim-         monthly sales by product category in a Portuguese retail company.
ilarities (e.g., the most important variables across different problems       We also gather 9 variables that describe store layout, store profile,
1 This work is partially funded by the ERDF – European Regional Develop-
                                                                              client profile and seasonality. Six regression methods from R
  ment Fund through the COMPETE Programme (operational programme for          packages were tested: Cubist, NN, SVM, Generalized Boosted
  competitiveness) and by National Funds through the FCT – Fundação para    Regression, MARS and Random Forests (RF). The DM algorithm
  a Ciência e a Tecnologia (Portuguese Foundation for Science and Technol-   with the most robust performance was RF.4 The models for the 89
  ogy) within project “FCOMP - 01-0124-FEDER-022701”                          categories were evaluated using the mean percentage error (Eq. 1)
2 Faculdade de Economia, Universidade do Porto, Portugal, email:
  fabiohscpinto@gmail.com                                                     where fi is the predicted value and ai the real value.
3 INESC TEC/Faculdade de Economia, Universidade do Porto, Portugal,
  email: csoares@fep.up.pt                                                    4 The randomForest package was used to fit RF models.
                                   Pn
                                       i=1
                                             | fia−a
                                                   i
                                                     i
                                                       |                                              Table 3.   Importance of Variables.
                         MPE =                                           (1)
                                             n
                                                                                          Rank                        Variable
                                                                                           1st           Max(sales)/Min(sales) in training set
The estimates were obtained using a sliding window approach where                          2nd      Mean number of changes in shelf size of category
the base-level data spans two years. For the majority of the models,                       3rd                  Number of instances
one and a half year (approximately 75% of the data) was used as
training set and the remaining half year was used as test set. For cat-
egories containing just one year of data, the first 9 months were used          The R2 obtained was 0.93. The importance of the variables for this
as training set and the remaining instances as test set. The results are        model is summarized in Table 3. These results show evidence that
summarized in Table 1.                                                          the metafeatures that represent the amplitude of the dependent vari-
                                                                                able in the training set and the amount of variance in data are very
                                                                                informative for predicting the accuracy of regression models.
              Table 1. Summary of base-level results (MPE)                         Finally, we used these results for meta-level variable selection. We
                                                                                re-executed the meta-level experiments, again estimating the perfor-
         Min.      1st Q.    Median      Mean          3rd Q.   Max.            mance of the RF for regression with 10-fold cross-validation, with
         0.072     0.091      0.116      0.212          0.211   1.209
                                                                                only the three most informative metafeatures. The results are sum-
                                                                                marized in Table 4.

Modelling the variance in results across different product categories
is important not only to predict the performance of models but also to              Table 4. Meta-level results obtained with three selected metafeatures in
understand it. A better understanding of the factors affecting the per-                   terms of Relative Mean Squared Error and the R2 and their
formance of the algorithm may lead us to better results. For that pur-                                     standard-deviations (SD).
pose, we used a MtL approach with the following problem-specific                              mtry     RMSE        R2       RMSE SD R2 SD
metafeatures:                                                                                   2       0.138     0.678      0.0552        0.234
                                                                                                3        0.14     0.687      0.0571         0.23
• Number of instances
• Type of sliding window
• Variables that capture the amount of variation in store layout                   The results shown in Table 4 are even better than those obtained
• Variables that capture the diversity of store profile in the data             previously. However, they must be interpreted carefully, as the same
• Variables that capture the amplitude5 of sales in the test set                dataset was used to do metafeature selection and test its effective-
• Variables that capture the amplitude of store layout attributes in            ness, thus increasing the potential for overfitting. Nevertheless, this
  the training and test sets                                                    evidence let us believe that some of the metafeatures used before
                                                                                were carrying noise and that results can improve by metafeature se-
The metadata contained 89 examples (corresponding to the 89 cat-                lection.
egories) described by 14 predictors. The meta-level error of RF for
regression was estimated using 10-fold cross-validation.6 The num-
ber of trees was set to 500, as improvement in performance was not
                                                                                4      Conclusions
found for larger values; and 3 values where tested for the mtry pa-             We successfully used MtL with domain-specific metafeatures to pre-
rameter, which controls the number of variables randomly sampled                dict the accuracy of regression models in predicting sales by product
as candidates at each split [4]. The results are summarized in Table 2.         category in the retail industry. Our work shows evidence that, when
                                                                                possible, using domain knowledge to design metafeatures is advan-
                                                                                tageous.
  Table 2. Meta-level results obtained with all metafeatures in terms of           We plan to extend this approach to predict the performance of mul-
Relative Mean Squared Error and the R2 and their standard-deviations (SD).      tiple algorithms. Additionally, we will compare these results with the
           mtry     RMSE         R2     RMSE SD R2 SD                           results of MtL using traditional metafeatures.
              2     0.148       0.66      0.0847       0.215
              8     0.146      0.663      0.0877       0.234
             14      0.15       0.65      0.0871       0.224                    REFERENCES
                                                                                [1] Pavel Brazdil, Christophe Giraud-Carrier, Carlos Soares, and Ricardo
                                                                                    Vilalta, Metalearning: Applications to Data Mining, Cognitive Tech-
                                                                                    nologies, Springer, Berlin, Heidelberg, 2009.
The best results were obtained with the mtry parameter set to 8. Even
                                                                                [2] Thomas H Davenport, ‘Realizing the Potential of Retail Analytics’,
with a standard deviation of 0.23, a value of 0.66 for R2 gives us con-             Working Knowledge Research Report, (2009).
fidence about the capacity of the regression metamodel in predicting            [3] Alexandros Kalousis, João Gama, and Melanie Hilario, ‘On Data and
the performance of future RF models.                                                Algorithms: Understanding Inductive Performance’, Machine Learning,
    The next step is to identify the metafeatures that contributed the              54(3), 275–312, (2004).
                                                                                [4] A Liaw and M Wiener, ‘Classification and Regression by randomForest’,
most to this result. The RF implementation that we used includes a                  R News, II/III, 18–22, (2002).
function to measure the importance of predictors in the classifica-             [5] F. Serban, J. Vanschoren, J.U. Kietz, and A. Bernstein, ‘A survey of in-
tion/regression model. We applied the algorithm on all 89 instances.                telligent assistants for data analysis’, ACM Computing Surveys, (2012).
                                                                                    in press.
5 We calculate “amplitude” of a variable by dividing the largest value by the
                                                                                [6] Françoise Soulié-Fogelman, ‘Data Mining in the real world: What do we
  smallest.                                                                         need and what do we have?’, in Proceedings of the KDD Workshop on
6 The package caret for R was used for 10-fold cross validation estimation.
                                                                                    Data Mining for Business Applications, eds., R Ghani and C Soares, pp.
  Size of each sample: 81, 80, 80, 81, 80, 79, 80, 80, 79 and 81.                   44–48, (2006).