=Paper=
{{Paper
|id=Vol-2259/aics_20
|storemode=property
|title=A Novel Approach to Recommendation Algorithm Selection using Meta-Learning
|pdfUrl=https://ceur-ws.org/Vol-2259/aics_20.pdf
|volume=Vol-2259
|authors=Andrew Collins,Dominika Tkaczyk,Joeran Beel
|dblpUrl=https://dblp.org/rec/conf/aics/CollinsTB18
}}
==A Novel Approach to Recommendation Algorithm Selection using Meta-Learning==
A Novel Approach to Recommendation Algorithm Selection using Meta-Learning Andrew Collins, Dominika Tkaczyk and Joeran Beel ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin, Ireland ancollin@tcd.ie, d.tkaczyk@gmail.com, Joeran.Beel@adaptcentre.ie Abstract. The effectiveness of recommendation algorithms is typically assessed with evaluation metrics such as root mean square error, F1, or click through rates, calculated over entire datasets. The best algorithm is typically chosen based on these overall metrics. However, there is no single-best algorithm for all users, items, and contexts. Choosing a single algorithm based on overall evaluation re- sults is not optimal. In this paper, we propose a meta-learning-based approach to recommendation, which aims to select the best algorithm for each user-item pair. We evaluate our approach using the MovieLens 100K and 1M datasets. Our ap- proach (RMSE, 100K: 0.973; 1M: 0.908) did not outperform the single-best al- gorithm, SVD++ (RMSE, 100K: 0.942; 1M: 0.887). We also develop a distinc- tion between meta-learners that operate per-instance (micro-level), per-data sub- set (mid-level), and per-dataset (global level). Our evaluation shows that a hypo- thetically perfect micro-level meta-learner would improve RMSE by 25.5% for the MovieLens 100K and 1M datasets, compared to the overall-best algorithms used. Keywords: recommender system, evaluation, meta learning, machine learning 1 Introduction The ‘algorithm selection problem’ describes the challenge of finding the most effective algorithm for a given recommendation scenario. Some typical recommendation scenar- ios are news websites [3], digital libraries [4, 5], movie-streaming platforms [13]. The performance of recommender system algorithms vary in these different scenarios [3, 6, 10, 11, 15] as illustrated in Fig. 1. Performance variation occurs for many reasons, for example, the effectiveness of collaborative filtering algorithms changes depending on the number of ratings available from users [10]. Algorithms also perform differently depending on the demographic characteristics of users [6][11], depending on the time of the day that recommendations are delivered, the number of requested recommenda- tions, and many other factors [3]. No single algorithm is best in all scenarios. This publication emanated from research conducted with the financial support of Science Foun- dation Ireland (SFI) under Grant Number 13/RC/2106. 0.40 Precision 0.30 0.20 0.10 0.00 tages tecchan motor- ksta.de sport1.de cio.de spiegel.de nel.de talk.de User-based CF 0.28 0.28 0.23 0.26 0.17 0.15 Mst. ppl. sequence 0.23 0.32 0.31 0.15 0.10 0.16 Item-based CF 0.20 0.28 0.17 0.18 0.11 0.05 Content-based 0.07 0.15 0.09 0.13 0.07 0.31 Most popular 0.01 0.25 0.01 0.56 0.38 0.00 Website Fig. 1. The precision of five recommendation algorithms on six news platforms. On almost every platform, a different algorithm performs best [3] To identify the most effective recommendation algorithm for a given use-case, practi- tioners typically assess a pool of algorithms for suitability. These algorithms are trained using historical data, and their effectiveness is estimated using cross-validation tech- niques. The best candidate-algorithm is typically chosen for a scenario on the basis of having the lowest predictive error, or highest accuracy [16]. Single-number metrics such as precision, recall, nDCG, RMSE, and click-through rate, are used. We argue that focusing on the overall single-number performances of algorithms is not optimal. If we were able to accurately choose the most effective recommendation algorithm for a given user, item, and context, the overall effectiveness of the recom- mender system should improve. Meta-learning for algorithm selection aims to predict the best algorithm to use in a given scenario. It does this by learning the relationship between characteristics of data, and the performance of algorithms for that data [18]. It is useful in situations where distinct algorithms perform differently in varied scenarios. Meta-learners for algorithm selection can be trained and used at different levels of analysis. We develop here the following distinctions between these meta-learners: 1. Global-level meta-learners use the characteristics of data to select the overall- best algorithm for entire datasets, or an entire platform (e.g. a particular news website) 2. Mid-level meta-learners select the overall-best algorithm for subsets of data. For example, users in a recommendation scenario may differ by how many explicit ratings they have made. Collaborative filtering algorithms are inap- propriate for users who have made no ratings. Mid-level meta-learners may operate on any such entity (users, items, gender, age, etc.). They may also operate on arbitrary subsets of data (e.g. clusters selected via unsupervised learning). 3. Micro-level meta-learners attempt to select the best algorithm for every in- stance in a dataset, or every single recommendation request on a given plat- form. In this paper, we propose an application of micro-level meta-learning to recommenda- tion [2]. Most existing meta-learning approaches for recommendation act at a global- level; they attempt to select the single-best algorithm for entire datasets. We attempt to select the best algorithm separately for each dataset instance. Related Work Meta-learning has been used for algorithm-selection in recommender systems [1, 7–9, 12, 19]. These authors manually define meta-features, which aggregate information from datasets into single-number statistics. For example, the number of instances in the dataset is a ‘simple’ meta-feature, the mean or kurtosis of a column is a ‘statistical’ meta-feature. They then use supervised machine learning to learn the relationships be- tween the meta-features and the performance of recommendation algorithms on da- tasets, measured by standard metrics. Most authors attempt to select the single-best al- gorithm for entire datasets; they are proposing global-level meta-learners. This is ben- eficial; it simplifies the process of choosing an algorithm for a recommendation sce- nario, for example. However, it results in non-optimal recommender system perfor- mance, as the best algorithm for each instance is not selected. Ekstrand and Riedl [10] propose a mid-level meta-learner; they attempt to select the best algorithm for subsets of data in a dataset. They construct an ensemble of a small number of diverse algorithms and compare the ensemble’s performance to baseline al- gorithms. Using a classifier and one meta-feature, an attempt is made to choose the best algorithm between item-item and user-user collaborative filtering for each user. Their meta-learning approach did not outperform the overall-best algorithm (RMSE; meta- learner ~0.78, item-item CF: ~0.74). Ensemble approaches exist that uniquely combine several algorithms for each in- stance, such as stacked generalization [22], and feature-weighted linear stacking [20]. These methods have produced good results [13, 17, 21], however they require that all algorithms in the pool are executed before their output is combined, or before a single algorithm is selected. For a large pool of algorithms, this may be a prohibitive require- ment. To the best of our knowledge, there are no applications of meta-learning for recom- mender systems that select algorithms at a more granular level than per-user. Methodology Our meta-learner aims to use the best algorithm for each user-item pair in a recommen- dation dataset, from a pool of single recommendation algorithms. The pool of algorithms used in our system includes nine collaborative filtering algorithms from the Surprise recommendation library2: 1. Co-clustering 2. KNN (Baseline) 3. KNN (Basic) 4. KNN (with Means) 5. Non-negative Matrix Factorization (NMF) 6. SVD 7. SVD++ 8. Slope One 9. Baseline - A collaborative filtering baseline which always predicts the overall- average rating, biased by the overall-average rating for the user, and overall- average rating for the item We performed our experiments using the MovieLens 100K and 1M datasets [14]. We chose these datasets over others as they contain demographic information for users, and detailed item information, which may be useful in attempting to select a recom- mendation algorithm (Table 1). To assess the potential improvements that a hypothetically ‘perfect’ micro-level meta-learner could offer over an overall-best algorithm, we first evaluate our pool of algorithms individually on the MovieLens 100K and 1M datasets. We randomly divide each dataset with a 70%:30% training:test split. We train each algorithm, and for each user-item pair in the test set we note the error between the predicted rating and true rating. We calculate the overall RMSE for each algorithm. We also note which algo- rithm performed best for each user-item pair; a perfect micro-level meta-learner would be able to choose this algorithm. This process is illustrated in Table 1. We calculate the overall RMSE that a perfect micro-level meta-learner would achieve. In a second evaluation, we test our micro-level meta-learner. We randomly divide each dataset into two equal subsets: a training set and an evaluation set. The training set is used to train our nine individual collaborative filtering recommendation algo- rithms, resulting in nine ready-to-use models. To evaluate the meta-learner, we perform a 5-fold cross validation on the evaluation set. Each fold splits the evaluation set into two subsets: a meta-training set, and a test set. The trained models of the individual recommendation algorithms are applied to the meta-training set, resulting in errors of the algorithms. The meta-learner is then trained on these errors, resulting in the meta- model. Finally, the meta-model is tested on the test set. We further assess a ‘perfect’ meta-learner using this evaluation set. We use RMSE to evaluate our approach. 2 http://surpriselib.com/ Table 1: An illustration of the MovieLens dataset. The algorithm with the best predicted error rate per row is highlighted. A hypothetical perfect meta-learner should predict this algorithm. User attributes Movie attributes Predicted Ratings (Error Rates) ID Gender Occupation Movie Release Year Genre Rating Slope KNN User ID Age SVD … Male Female artist doctor educator … ID 60s 70s 80s … Action Comedy Crime … One Basic 1 506 46 1 0 1 0 0 … 568 0 0 0 … 1 0 0 … 5 -1.06 -1.08 -1.12 … 2 363 20 1 0 1 0 0 … 849 0 0 0 … 1 0 0 … 2 0.26 0.23 0.66 … 3 842 40 1 0 0 1 0 … 874 0 0 0 … 0 0 0 … 5 -1.94 -2.09 -2.20 … 4 312 48 1 0 0 0 0 … 241 0 0 0 … 1 0 0 … 3 1.09 0.87 0.89 … 5 42 30 1 0 0 0 1 … 294 0 0 0 … 0 1 0 … 4 -0.68 -0.82 -0.51 … 6 812 22 1 0 0 0 0 … 326 0 0 0 … 1 0 0 … 4 -0.72 -0.33 -0.51 … 7 450 35 0 1 0 0 1 … 3 0 0 0 … 0 0 0 … 4 -0.37 -0.54 -0.87 … 8 90 60 1 0 0 0 1 … 42 0 0 0 … 0 1 0 … 4 0.17 0.10 -0.03 … 9 36 19 0 1 0 0 0 … 882 0 0 0 … 0 0 0 … 5 -1.24 -0.71 -1.43 … 10 551 25 1 0 0 0 0 … 235 0 0 0 … 1 1 0 … 1 1.97 2.16 1.79 … 11 553 58 1 0 0 0 1 … 1126 0 0 0 … 0 0 0 … 4 -0.53 -0.61 -0.95 … 12 168 48 1 0 0 0 0 … 1028 0 0 0 … 0 1 0 … 2 0.80 0.88 1.34 … During training, we perform the following steps (Fig. 2): 1. Single algorithms are trained on the training set. The result is a single model for each collaborative filtering algorithm. This model can predict a rating for a given user-item pair. 2. Each trained single algorithm is applied to every row in the meta-learner train- ing set. This gives us a rating-prediction for each algorithm, for training set rows. 3. For each single algorithm and each row in the meta-learner training set, a rat- ing-prediction error is calculated. This error is the difference between the pre- dicted rating and true rating. These rating-prediction errors are illustrated in Table 1. 4. Using the meta-learner training set, we train a linear model for each single algorithm. These linear models are trained on the content-features from rows, for example: the gender, age and occupation of the user, genre and year of the movie. We also include 10 meta-features: the rating mean, standard deviation, minimum, maximum and median, for each user and each item. Categorical features are one-hot-encoded. The training target is the rating-prediction error from Step 3. These models allow us to predict each algorithm’s error, based on the content-features and meta-features of the user-item pair. In the prediction stage of our meta-learner, the rating for a given user-item pair is calculated in the following steps: 1. For each algorithm in the pool we predict the error that the algorithm will make for this user-item. 2. We rank the algorithms according to the absolute value of their predicted errors. We choose the algorithm with the lowest predicted error. 3. The chosen algorithm is applied to the user-item pair. The rating predicted by the chosen algorithm is the final output of our system. We compare our meta-learner to two baselines. The first is the best single algorithm for the dataset. The second is a simple ensemble, which averages the ratings predicted by all single algorithms for each row. Fig. 2. A block diagram of our Meta-learned Recommender System. Results & Discussion The results from our first evaluation of a hypothetical ‘perfect’ micro-level meta- learner are shown in Fig. 3 and Fig. 4. Fig. 3. The average error (RMSE) of different collaborative-filtering algorithms on the Mov- ieLens 100K and 1M datasets. The RMSE for a ‘perfect’ micro-level meta-learner, in which the best algorithm is chosen for each user-item pair, is shown. For both MovieLens 100K and 1M, the algorithms with the lowest RMSE from our collection of collaborative filtering algorithms are SVD++ (RMSE; ML100K: 0.926, ML1M: 0.876), followed by a variant of k-nearest neighbors (KNN Baseline) (RMSE; ML100K: 0.934) and SVD (ML1M: 0.892) (Fig. 3). An evaluation of MovieLens test-sets using these algorithms would suggest to an operator that SVD++ and KNN (Baseline) are the best candidate algorithm to use. How- ever, for each row in the 100K dataset, SVD++ is not the best algorithm most often (ML100K; SVD++: 15.85%, vs. KNN (Basic): 16.7%) (Fig. 4). The second-best algo- rithm by RMSE (KNN (Baseline)) is least often the best algorithm for each user-item in the 100K dataset. In the 1M dataset, the second most-frequently accurate algorithm KNN (Basic) (16.10%), is the least accurate with regards to RMSE (0.936). Using the overall-best algorithms for these datasets is therefore a significant com- promise. In a hypothetical scenario in which the best algorithm per dataset instance could be chosen – i.e. with a perfect meta-learner – RMSE would be improved by 25.5% for both 100K and 1M when compared to their respective overall-best algo- rithms (Fig. 3). Fig. 4. Percentage of times each collaborative filtering algorithm was best on the MovieLens 100K and 1M datasets. The algorithms least often best are highlighted (ML100K: KNN (Base- line); ML1M: Baseline). The results of our second evaluation are presented in Fig. 5. Our meta-learner (RMSE, 100K: 0.973; 1M: 0.908) performed 2-3% worse than the best individual algorithm SVD++ (RMSE, 100K: 0.942; 1M: 0.887) and the average-rating baseline (RMSE, 100K: 0.943; 1M: 0.893). These results suggest that the current implementation of our meta-learner is unable to accurately rank algorithms according to their rating errors. Fig. 5. Effectiveness of the meta-learned recommender system, compared to 1) the overall-best algorithm, 2) an ensemble which predicts an average rating of all algorithms per row, 3) the theoretical best case in which the actual best algorithm per user-item pair The nine algorithms used are of a similar class. We expect that a more heterogenous pool of algorithms may provide better results. It is also possible that, because the algo- rithms we used have been trained on the same dataset that our meta-learner has been trained on, they are all already well fitted to the data. In such a case, the differences between error-predictions and real errors might be too small to allow for accurate rank- ings of the top algorithms: our predictions for the worst algorithm per row are twice as good as our predictions for the best (Accuracy; 0.12 vs 0.21). Linear regression may also not be suitable to model rating-prediction errors. More advanced algorithms may be more suitable for meta-learning. Our approach is computationally inexpensive compared to standard ensembles. As the final prediction is calculated by one chosen algorithm, we do not need to obtain and weight predictions from all algorithms in a pool to make our final prediction. We also do not need to use all algorithms to retrain our system when a new algorithm is intro- duced, as the rating-prediction error models are independent of each other. For these reasons, in future work we hope to improve our rating-prediction error models. References [1] Ahsan, M. and Ngo-Ye, L. 2005. A Conceptual Model of Recommender System for Algorithm Selection. AMCIS 2005 Proceedings. (2005), 122. [2] Beel, J. 2017. A Macro/Micro Recommender System for Recommendation Algorithms [Proposal]. [3] Beel, J., Breitinger, C., Langer, S., Lommatzsch, A. and Gipp, B. 2016. Towards reproducibility in recommender-systems research. User modeling and user- adapted interaction. 26, 1 (2016), 69–101. [4] Beel, J., Gipp, B., Langer, S. and Breitinger, C. 2016. paper recommender systems: a literature survey. International Journal on Digital Libraries. 17, 4 (2016), 305–338. [5] Beel, J., Gipp, B., Langer, S., Genzmehr, M., Wilde, E., Nürnberger, A. and Pitman, J. 2011. Introducing Mr. DLib, a Machine-readable Digital Library. Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries (2011), 463–464. [6] Beel, J., Langer, S., Nürnberger, A. and Genzmehr, M. 2013. The Impact of Demographics (Age and Gender) and Other User-Characteristics on Evaluating Recommender Systems. Research and Advanced Technology for Digital Libraries - International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Valletta, Malta, September 22-26, 2013. Proceedings (2013), 396–400. [7] Cunha, T., Soares, C. and Carvalho, A.C. de 2018. CF4CF: Recommending Collaborative Filtering algorithms using Collaborative Filtering. arXiv preprint arXiv:1803.02250. (2018). [8] Cunha, T., Soares, C. and Carvalho, A.C. de 2018. Metalearning and Recommender Systems: A literature review and empirical study on the algorithm selection problem for Collaborative Filtering. Information Sciences. 423, (2018), 128–144. [9] Cunha, T., Soares, C. and Carvalho, A.C. de 2016. Selecting Collaborative Filtering algorithms using Metalearning. Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2016), 393–409. [10] Ekstrand, M. and Riedl, J. 2012. When recommenders fail: predicting recommender failure for algorithm selection and combination. Proceedings of the sixth ACM conference on Recommender systems (2012), 233–236. [11] Ekstrand, M.D. and Pera, M.S. 2017. The Demographics of Cool: Popularity and Recommender Performance for Different Groups of Users. Proceedings of the Poster Track of the 11th ACM Conference on Recommender Systems (RecSys 2017), Como, Italy, August 28, 2017. (2017). [12] Garc𝚤a-Saiz, D. and Zorrilla, M. Metalearning-based recommenders: towards automatic classification algorithm selection. [13] Gomez-Uribe, C.A. and Hunt, N. 2015. The Netflix Recommender System: Algorithms, Business Value, and Innovation. ACM Trans. Manage. Inf. Syst. 6, 4 (2015), 13:1–13:19. [14] Harper, F.M. and Konstan, J.A. 2016. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis). 5, 4 (2016), 19. [15] Im, I. and Hars, A. 2007. Does a one-size recommendation system fit all? the effectiveness of collaborative filtering based recommendation systems across different domains and search modes. ACM Trans. Inf. Syst. 26, 1 (2007), 4. [16] Kalousis, A. 2002. Algorithm selection via meta-learning. University of Geneva. [17] Koren, Y. 2009. The bellkor solution to the netflix grand prize. Netflix prize documentation. 81, (2009), 1–10. [18] Lemke, C., Budka, M. and Gabrys, B. 2015. Metalearning: a survey of trends and technologies. Artificial intelligence review. 44, 1 (2015), 117–130. [19] Romero, C., Olmo, J.L. and Ventura, S. 2013. A meta-learning approach for recommending a subset of white-box classification algorithms for Moodle datasets. Educational Data Mining 2013 (2013). [20] Sill, J., Takács, G., Mackey, L. and Lin, D. 2009. Feature-weighted linear stacking. arXiv preprint arXiv:0911.0460. (2009). [21] Töscher, A., Jahrer, M. and Bell, R.M. 2009. The bigchaos solution to the netflix grand prize. Netflix prize documentation. (2009), 1–52. [22] Wolpert, D.H. 1992. Stacked generalization. Neural networks. 5, 2 (1992), 241– 259.