Predicting the Accuracy of Regression Models in the Retail Industry1 Fábio Pinto2 and Carlos Soares3 Abstract. Companies are moving from developing a single model are also similar). Therefore, the process of model building should not for a problem (e.g., a regression model to predict general sales) to be independent. The knowledge obtained from generating a model developing several models for sub-problems of the original problem for one sub-problem can and should be applied to the process of de- (e.g., regression models to predict sales of each of its product cate- veloping the model for the other sub-problems. Different approaches gories). Given the similarity between the sub-problems, the process can be used for that purpose, two of them being MtL and transfer of model development should not be independent. Information learning [1]. should be shared between processes. Different approaches can be Our goals with this work is to use metalearning (MtL) to predict used for that purpose, including metalearning (MtL) and transfer the performance of one model based on the performance of models learning. In this work, we use MtL to predict the performance of that were previously developed to predict sales of product categories a model based on the performance of models that were previously in a retail company in Portugal, and unveil the attributes that are more developed. Given that the sub-problems are related (e.g., the schemas important for that prediction. The paper is organised as follows. In of the data are the same), domain knowledge is used to develop Section 2, we briefly survey the concept of MtL and the importance the metafeatures that characterize them. The approach is applied of metafeatures. Our case study is presented in Section 3. Finally, in to the development of models to predict sales of different product Section 4 we expose some conclusions. categories in a retail company from Portugal. 2 Metafeatures for Metalearning 1 Introduction MtL can be defined as the use of data about the performance of machine learning algorithms on previous problems to predict their The retail industry is a world of extreme competitiveness. Compa- performance on future ones [1]. For more information on MtL, we nies struggle on a daily basis for the loyalty of their clients through refer the reader to [5, 1]. diverse marketing actions, while providing better products, better One of the essential issues about MtL are the metafeatures prices and better services. The growing need for analytic tools that that characterize the problem. Which metafeatures contain useful enhance retailers performance is unquestionable, and Data Mining information to predict the performance of an algorithm on a given (DM) is central in this trend [2]. problem? Much work has been done on this topic (e.g., [3]). Typi- Sales prediction is one of the main tasks in retail. The ability to as- cally the work on MtL includes problems from different domains, so sess the impact that a sudden change in a particular factor will have the metafeatures need to be very generic (e.g., number of attributes on the sales of one or more products is a major tool for retailers. DM and mutual information between symbolic attributes and the target). is one of the approaches for this task. However, in more specific settings, metafeatures should encode In early approaches to predict sales, a single model could be used more particular information about the data, which probably contain for a whole business. As more detailed data becomes available, retail useful information about the performance of the algorithms. companies are dividing the problem into several sub-problems (pre- dict the sales of each of its stores or product categories). The same trend can be observed in several industries [6]. In this approach, there are obviously many similarities between the 3 Case Study sub-problems. Not only is the structure of the data typically the same across all sub-problems (e.g., the variables are the same and their The base-level data used in this study was collected to model domains are similar) but also the patterns in that data may have sim- monthly sales by product category in a Portuguese retail company. ilarities (e.g., the most important variables across different problems We also gather 9 variables that describe store layout, store profile, 1 This work is partially funded by the ERDF – European Regional Develop- client profile and seasonality. Six regression methods from R ment Fund through the COMPETE Programme (operational programme for packages were tested: Cubist, NN, SVM, Generalized Boosted competitiveness) and by National Funds through the FCT – Fundação para Regression, MARS and Random Forests (RF). The DM algorithm a Ciência e a Tecnologia (Portuguese Foundation for Science and Technol- with the most robust performance was RF.4 The models for the 89 ogy) within project “FCOMP - 01-0124-FEDER-022701” categories were evaluated using the mean percentage error (Eq. 1) 2 Faculdade de Economia, Universidade do Porto, Portugal, email: fabiohscpinto@gmail.com where fi is the predicted value and ai the real value. 3 INESC TEC/Faculdade de Economia, Universidade do Porto, Portugal, email: csoares@fep.up.pt 4 The randomForest package was used to fit RF models. Pn i=1 | fia−a i i | Table 3. Importance of Variables. MPE = (1) n Rank Variable 1st Max(sales)/Min(sales) in training set The estimates were obtained using a sliding window approach where 2nd Mean number of changes in shelf size of category the base-level data spans two years. For the majority of the models, 3rd Number of instances one and a half year (approximately 75% of the data) was used as training set and the remaining half year was used as test set. For cat- egories containing just one year of data, the first 9 months were used The R2 obtained was 0.93. The importance of the variables for this as training set and the remaining instances as test set. The results are model is summarized in Table 3. These results show evidence that summarized in Table 1. the metafeatures that represent the amplitude of the dependent vari- able in the training set and the amount of variance in data are very informative for predicting the accuracy of regression models. Table 1. Summary of base-level results (MPE) Finally, we used these results for meta-level variable selection. We re-executed the meta-level experiments, again estimating the perfor- Min. 1st Q. Median Mean 3rd Q. Max. mance of the RF for regression with 10-fold cross-validation, with 0.072 0.091 0.116 0.212 0.211 1.209 only the three most informative metafeatures. The results are sum- marized in Table 4. Modelling the variance in results across different product categories is important not only to predict the performance of models but also to Table 4. Meta-level results obtained with three selected metafeatures in understand it. A better understanding of the factors affecting the per- terms of Relative Mean Squared Error and the R2 and their formance of the algorithm may lead us to better results. For that pur- standard-deviations (SD). pose, we used a MtL approach with the following problem-specific mtry RMSE R2 RMSE SD R2 SD metafeatures: 2 0.138 0.678 0.0552 0.234 3 0.14 0.687 0.0571 0.23 • Number of instances • Type of sliding window • Variables that capture the amount of variation in store layout The results shown in Table 4 are even better than those obtained • Variables that capture the diversity of store profile in the data previously. However, they must be interpreted carefully, as the same • Variables that capture the amplitude5 of sales in the test set dataset was used to do metafeature selection and test its effective- • Variables that capture the amplitude of store layout attributes in ness, thus increasing the potential for overfitting. Nevertheless, this the training and test sets evidence let us believe that some of the metafeatures used before were carrying noise and that results can improve by metafeature se- The metadata contained 89 examples (corresponding to the 89 cat- lection. egories) described by 14 predictors. The meta-level error of RF for regression was estimated using 10-fold cross-validation.6 The num- ber of trees was set to 500, as improvement in performance was not 4 Conclusions found for larger values; and 3 values where tested for the mtry pa- We successfully used MtL with domain-specific metafeatures to pre- rameter, which controls the number of variables randomly sampled dict the accuracy of regression models in predicting sales by product as candidates at each split [4]. The results are summarized in Table 2. category in the retail industry. Our work shows evidence that, when possible, using domain knowledge to design metafeatures is advan- tageous. Table 2. Meta-level results obtained with all metafeatures in terms of We plan to extend this approach to predict the performance of mul- Relative Mean Squared Error and the R2 and their standard-deviations (SD). tiple algorithms. Additionally, we will compare these results with the mtry RMSE R2 RMSE SD R2 SD results of MtL using traditional metafeatures. 2 0.148 0.66 0.0847 0.215 8 0.146 0.663 0.0877 0.234 14 0.15 0.65 0.0871 0.224 REFERENCES [1] Pavel Brazdil, Christophe Giraud-Carrier, Carlos Soares, and Ricardo Vilalta, Metalearning: Applications to Data Mining, Cognitive Tech- nologies, Springer, Berlin, Heidelberg, 2009. The best results were obtained with the mtry parameter set to 8. Even [2] Thomas H Davenport, ‘Realizing the Potential of Retail Analytics’, with a standard deviation of 0.23, a value of 0.66 for R2 gives us con- Working Knowledge Research Report, (2009). fidence about the capacity of the regression metamodel in predicting [3] Alexandros Kalousis, João Gama, and Melanie Hilario, ‘On Data and the performance of future RF models. Algorithms: Understanding Inductive Performance’, Machine Learning, The next step is to identify the metafeatures that contributed the 54(3), 275–312, (2004). [4] A Liaw and M Wiener, ‘Classification and Regression by randomForest’, most to this result. The RF implementation that we used includes a R News, II/III, 18–22, (2002). function to measure the importance of predictors in the classifica- [5] F. Serban, J. Vanschoren, J.U. Kietz, and A. Bernstein, ‘A survey of in- tion/regression model. We applied the algorithm on all 89 instances. telligent assistants for data analysis’, ACM Computing Surveys, (2012). in press. 5 We calculate “amplitude” of a variable by dividing the largest value by the [6] Françoise Soulié-Fogelman, ‘Data Mining in the real world: What do we smallest. need and what do we have?’, in Proceedings of the KDD Workshop on 6 The package caret for R was used for 10-fold cross validation estimation. Data Mining for Business Applications, eds., R Ghani and C Soares, pp. Size of each sample: 81, 80, 80, 81, 80, 79, 80, 80, 79 and 81. 44–48, (2006).