Greedy Ensemble Selection for Top-N Recommendations Tobias Vente1 , Zainil Mehta1 , Lukas Wegmeth1 and Joeran Beel1 1 Intelligent Systems Group, University of Siegen, Germany Abstract Despite the pivotal role ensembling played in the success of BellKors’ Pragmatic Chaos Team in winning the Netflix Prize challenge in the early 2000s, it never became a standard practice in recommender systems. In contrast, ensembling, particularly greedy ensemble selection, has become a standard practice in machine learning to enhance performance compared to a single model. Despite the success of greedy ensemble selection in classification and regression tasks, it has not been adapted for top-n prediction tasks. Hence, in this study, we aim to analyze the potential of greedy ensemble selection to boost the performance of recommender system models for top-n prediction tasks. We adapt the concept of greedy ensemble selection for top-n prediction tasks, train and optimize ten factorization- and neighborhood-based models on five datasets, and compare the performance of the ensemble to that of the individual models. Our experiments reveal that greedy ensemble selection always performs better than a single model and enhances performance by an average of 8.8% on NDCG@5, 8.6% on NDCG@10, and 16.3% on NDCG@20 compared to the single best model. Keywords Ensembling, Recommender Systems, Algorithm Selection, Automatic Algorithm Selection 1. Introduction ranking performance. We focus on ten fast and easy-to-train factorization- and neighborhood-based models. We then Ensembling played a pivotal role for BellKors’ Pragmatic evaluate the ensemble output of these ten models on five Chaos Team, enhancing their recommender system to win datasets using NDCG@k, with k set to 5, 10, and 20, aiming the Netflix Prize challenge in the early 2000s [1, 2]. Despite to quantify the performance and robustness improvements its success in the competition, ensembling did not become compared to single optimized models. a standard practice in the field of recommender systems. Our contribution is the implementation of greedy en- Today, mainly hybrid recommender systems rely on ensem- semble selection for top-n ranking prediction tasks, along bling trained on different data or to aggregate predictions with a comprehensive analysis of its performance impact [3, 4]. Thereby, often combining collaborative filtering with and robustness compared to single optimized models. Our content-based models to cancel out the weaknesses of indi- results indicate that greedy ensemble selection improves vidual models like the cold-start problem [5]. performance by an average of 8.8% on NDCG@5, 8.6% on In comparison, in machine learning, ensembling, particu- NDCG@10, and 16.3% on NDCG@20 compared to the single larly greedy ensemble selection, is a standard practice [6], best model on five datasets. Additionally, while no single enhancing performance by as much as 37% in best-case model performs best across all datasets, greedy ensemble scenarios and improving robustness [7]. Moreover, in auto- selection consistently performs best, representing the most mated machine learning, ensembling significantly enhances robust recommender with regard to performance. performance to the extent that some tools prioritize ensem- The implementation of greedy ensemble selection, along bling over further hyperparameter optimization [8]. with the code and necessary documentation to reproduce all However, despite the success of greedy ensemble selec- experiments, is publicly available in our GitHub repository1 . tion for regression and classification, it has never been adapted for top-n ranking prediction tasks [6]. While ensem- bling has proven effective in machine learning, it remains 2. Related Work a largely overlooked approach in the field of recommender systems. In recommender systems, researchers continue The use of ensembling techniques in recommender systems to debate whether the field makes progress, yet the focus is not new and has been covered in the literature [4, 3, 5, 10]. primarily remains on continuously developing more sophis- Today, primarily hybrid recommender systems use ensem- ticated algorithms [9]. Instead of implementing a new, more bling to mitigate the weaknesses of individual algorithms complex recommender system algorithm, we want to focus [3, 5, 4]. Thereby, hybrid recommender systems require on ensembling already existing algorithms. knowledge of the strengths and weaknesses of different al- Therefore, we want to analyze the potential of greedy gorithms to ensemble them effectively. In contrast to hybrid ensemble selection for top-n prediction tasks and answer recommender systems, our work focuses on ensembling rec- the question: RQ: How does the ensembling of factorization- ommender system algorithms without manually selecting and neighborhood-based models impact performance and complementary algorithm combinations. robustness compared to a single optimized model? As in our work, researchers have applied ensembling In this work, we adapt greedy ensemble selection for techniques that do not require manual model selection for top-n prediction tasks to assess its potential for enhancing ensembling. For example, they have used standard machine learning ensemble techniques, such as bagging and boosting to recommender systems [11, 10], allowing the ensembling RobustRecSys: Design, Evaluation, and Deployment of Robust Recom- mender Systems Workshop @ RecSys 2024, 18 October, 2024, Bari, Italy. of a diverse set of models without manual model selection. Envelope-Open tobias.vente@uni-siegen.de (T. Vente); However, their work mostly focuses on rating prediction zainil.mehta@student.uni-siegen.de (Z. Mehta); tasks. Furthermore, bagging and boosting require the modi- lukas.wegmeth@uni-siegen.de (L. Wegmeth); fication of training data. We focus on post-hoc ensembling, joeran.beel@uni-siegen.de (J. Beel) only taking model predictions into account. Orcid 0009-0003-8881-2379 (T. Vente); 0009-0002-0556-9493 (Z. Mehta); 0000-0001-8848-9434 (L. Wegmeth); 0000-0002-4537-5573 (J. Beel) 1 https://github.com/ISG-Siegen/greedy-ensemble-selection-for-top-n-r Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ecommendations CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Additionally, researchers have analyzed ensembling for subsets of 𝑝𝑛 ∈ 𝑃 and aggregating the π‘˜ β€² predictions by ag- various other aspects of recommender systems. Researchers gregating and re-ranking the prediction scores of all in the ensemble predictions of models trained on different datasets subset included models (Algorithm 1). If items appear in containing different user feedback types [12, 13]. Re- the π‘˜ β€² lists of multiple models, the item scores are summed searchers focus on ensembling techniques to ensemble mod- to reflect their collective relevance across models. We then els optimized for different objectives [14, 15]. Researchers select the top-π‘˜ predictions of the ensembled list. The per- implement ensembling techniques specifically designed and formance of every subset of all models 𝑝𝑛 ∈ 𝑃 is evaluated on tested for certain domains, applications, or with a limited the validation set. The best-performing ensemble of models number of base models [16, 14, 17]. Or optimize ensembling is then selected as the final result of the greedy ensemble techniques to specific datasets to showcase the capabilities selection, and their top-π‘˜ predictions are returned. of ensembling without focusing on the generalization [18]. However, these approaches have limitations: they often only work with models from the same algorithm, require 4. Experiments multiple data inputs, necessitate optimization for multiple We conducted all of our experiments with ten differ- objectives, or focus ensembling for rating predictions. ent factorization- and neighborhood-based algorithms and Recent work has focused on greedy ensemble selection greedy ensemble selection on five datasets. The hardware for recommender systems [19, 20]. In this work, the authors includes AMD EPYC 7452 CPU processors, each with 32 applied greedy ensemble selection to rating prediction tasks cores and a CPU frequency ranging from 2.35 to 3.35 GHz. by treating them as regression problems, thus utilizing the standard greedy ensemble selection approach for regres- sion. However, this method does not offer solutions for 4.1. Experimental Pipeline top-n ranking predictions, leaving a gap in the application In our experimental pipeline, we apply five-fold cross- of greedy ensemble selection in recommender systems. validation to all five datasets, randomly splitting each fold into three sets: 60% for training, 20% for validation, and 3. Greedy Ensemble Selection 20% for testing. With the training and validation sets, we optimize all included algorithms with two hours of ran- Greedy ensemble selection, as implemented in machine dom search to select the best hyperparameter configuration. learning, cannot be directly applied to top-n recommen- With the test set, we evaluate the final performance of the dations in recommender systems. Greedy ensemble selec- single models as well as the greedy ensemble selection. This tion for classification applies majority voting on predictions. combination of five-fold cross-validation and random search However, majority voting fails for top-n recommendations allows each algorithm to be finely tuned on every subset of since the number of repeating item recommendations across the data while mitigating the effects of randomness in data users is often insufficient. Similarly, taking the mean for re- splits and hyperparameter selection [21]. gression tasks is not applicable, as top-n recommendations We measure performance using NDCG@k for π‘˜ = 5, 10, 20 deal with ranked lists instead of single numeric values re- to evaluate top-n ranking predictions for different list turned by each model. Therefore, we focus on aggregating lengths. The NDCG@k model performance on the validation and re-ranking the prediction scores of multiple models to set, obtained from the random search optimization process, generate an ensemble of their outputs. is later used to weight the model predictions (Section 3). To apply greedy ensemble selection, we assume we have a set of trained and optimized models 𝑃, each predicting π‘˜ β€² 4.1.1. Datasets items and their validation performance. We aim to aggregate the π‘˜ β€² predictions of every model 𝑝𝑛 ∈ 𝑃 into one ranked list We include five distinct datasets of different sizes in our of length π‘˜. This requires the length π‘˜ β€² of every predicted experiments and refer to Table 1 for a detailed overview. list 𝑝𝑛 to be at least the same length as π‘˜. We transform convert datasets with user ratings, specifically In ensembling, we can utilize more predictions (π‘˜ β€² ) than Movielens-1M, Movielens-100k, and CiaoDVD, into binary the desired output list length (π‘˜). There is a chance that user feedback datasets as it is done in related work [4, 3, 5]. a prediction ranked at position π‘˜ + 1 or beyond still holds Furthermore, we prune all datasets such that all included relevance or contributes valuable information, even though users and items have at least five interactions, commonly it does not make it into the top π‘˜ predictions of the model. known as five-core pruning [22, 23, 24]. Table 1 shows all Taking π‘˜ β€² predictions into account enables the ensemble pro- included datasets’ statistics after preprocessing. cess to utilize a broader range of data, potentially improving the performance of the final ensembled recommendations. Table 1 Furthermore, utilizing π‘˜ β€² predictions does not increase the Data set statistics after five-core pruning and user feedback trans- prediction cost of the base models since all models score all formation. Split between the implicit (first part) and explicit (sec- ond part) feedback data sets. predictions anyway before selecting the top π‘˜. We normalize all prediction scores of π‘˜ β€² and multiply Name Interactions Users Items Sparsity each prediction score by the validation performance of the Citeulike-a[25] 200,180 5,536 15,429 99.77% respective model 𝑝𝑛 (Algorithm 1). The normalization en- Hetrec-Lastfm[26] 71,355 1,859 2,823 98.64% sures that all models have an equal impact in the ensembling, CiaoDVD2 23,467 1,582 1,788 99.17% while the validation performance multiplication weights the MovieLens-1M[27] 835,789 6,038 3,307 95.81% MovieLens-100k[27] 81,697 943 1,203 92.8% impact based on the models’ performance. Consequently, the impact of models with good validation performance will be increased relative to those of poorly performing models. Then, we initiate the greedy search by examining all Table 2 NDCG@10 performance of ten factorization- and neighborhood-based models, along with greedy ensemble selection, across five datasets. The best results for individual models are indicated in bold, while the overall best performance is highlighted in bold and underlined. The relative performance increase is calculated based on the performance of Popularity. Algorithms CiaoDVD CiteULike-A Hetrec-LastFM MovieLens-1M MovieLens-100k Rel. Performance Increase ALS 0.02 0.066 0.147 0.234 0.232 84% BPR 0.013 0.027 0.082 0.119 0.173 9% ImplicitMF 0.022 0.109 0.159 0.189 0.184 75% ItemItem-BM25 0.024 0.106 0.168 0.234 0.221 99% ItemItem-Cosine 0.01 0.082 0.169 0.217 0.21 82% ItemItem-TFIDF 0.016 0.094 0.168 0.225 0.218 90% ItemKNN 0.013 0.096 0.175 0.216 0.212 88% LogisticMF 0.018 0.058 0.135 0.161 0.191 49% UserKNN 0.026 0.112 0.157 0.235 0.225 99% Popularity 0.016 0.009 0.07 0.142 0.142 0% Greedy Ensemble 0.03 0.117 0.183 0.247 0.243 108% 4.1.2. Algorithms Average Ensembling Performance on Various k' Values We include ten factorization- and neighborhood-based rec- Greedy Ensemble Selection Single Best Algorithm Relative NDCG@10 Performance ommender systems algorithms in our experiments. The 108% algorithm implementations are from the Implicit [28] and LensKit [29] recommender systems libraries. The algo- 105% rithms from Implicit are Alternating Least Squares (ALS), Logistic Matrix Factorization (LogisticMF), Bayesian Person- 103% alized Ranking (BPR), and Item-Item Nearest Neighbors with distance metrics Cosine Similarity, TF-IDF, and BM25. The 101% algorithms from LensKit are Implicit Matrix Factorization (ImplicitMF), User-User Nearest Neighbors (UserKNN), Item- 99% Item Nearest Neighbors (ItemKNN), and most Popular. 5 10 15 25 50 75 100 125 150 4.2. Greedy Ensemble Selection k' Top N Recommendations We run greedy ensemble selection using various prediction input list lengths (π‘˜ β€² ) to examine the impact of predictions Figure 1: Performance differences of greedy ensemble selection with varying input prediction list lengths (π‘˜ β€² ) compared to the ranked higher than π‘˜ on the ensembling process (Section 3). virtual single best algorithm, averaged over five datasets. The x- We set π‘˜ β€² to 5, 10, 15, 25, 50, 75, 100, 125, and 150. This wide axis represents the input prediction list lengths (π‘˜ β€² ), and the y-axis range of π‘˜ β€² values helps us identify trends in the impact shows the NDCG@10 performance. The shaded band represents of longer input list lengths. All ensemble configurations the confidence interval of 95% for NDCG@10. are evaluated on the validation set. Ultimately, we select the ensemble configuration that performs best on average across all folds, with the optimal π‘˜ β€² value. NDCG@k scores on all five datasets. This approach reliably enhances performance and presents robust results compared to the single best model respectively. 5. Results On average, the overall performance advantage of greedy ensemble selection compared to the single best algorithm is Our experiments reveal that greedy ensemble selection en- 8.6% on NDCG@10 (108% vs. 99% for UserKNN, Table 2), but hances performance by an average of 8.8% on NDCG@5, varies across datasets. Greedy ensemble selection shows a 8.6% on NDCG@10 (Table 2), and 16.3% on NDCG@20 com- performance increase as high as 15.4% on CiaoDVD (0.03 vs. pared to the single best model. Since NDCG@10 is the most 0.026 for UserKNN, Section 5) and as low as 0.7% on Hetrec- commonly used evaluation metric with a cutoff of π‘˜ = 10 LastFM (0.175 vs. 0.174 for ItemKNN, Section 5). On datasets and the trends are consistent across all π‘˜ values, our analysis like CiteULike-A, Movielens-1M, and MovieLens-100k, the will focus on the NDCG@10 results. performance increase is approximately 5%. In general, the algorithm performance ranking varies Longer prediction input lists of length π‘˜ β€² (Section 3) im- across datasets. While Popularity always yields the lowest prove the overall model performance (Fig. 1). Predictions NDCG@10 score (Table 1), the best-performing algorithm that do not make it into the final π‘˜ predictions of the single changes. UserKNN performs best on CiaoDVD, CiteULike- models still contribute valuable information to the ensem- A, and MovieLens-1M, while ItemKNN performs best on ble process. While ensembling π‘˜ β€² = π‘˜ predictions already Hetrec-LastFM and ALS on MovieLens-100k. enhance performance, increasing π‘˜ β€² can further improve In contrast to the single algorithm performances, greedy results. We tested using up to 150 predictions per user from ensemble selection consistently outperforms all algorithms each model and observed that the ensemble’s performance across all datasets for all NDCG@π‘˜ values and adverts the plateaued beyond π‘˜ β€² = 100 predictions (Fig. 1). Additionally, algorithm selection problem. Greedy ensemble selection increasing π‘˜ β€² beyond this point incurs higher computational effectively identifies and aggregates a subset of models that costs during the ensembling process without yielding sig- outperforms the single best model, resulting in the highest nificant performance gains. 6. Discussion [4] R. Burke, Hybrid Systems for Personalized Recom- mendations, volume 3169 of Lecture Notes in Computer To comprehensively answer our research question: How Science, Springer Berlin Heidelberg, Berlin, Heidelberg, does the ensembling of factorization- and neighborhood- 2005, p. 133–152. URL: http://link.springer.com/10.100 based models impact performance compared to a single opti- 7/11577935_7. doi:10.1007/11577935_7 . mized model? We conclude that greedy ensemble selection [5] R. Burke, Hybrid recommender systems: Survey and of factorization- and neighborhood-based models enhances experiments, User Modeling and User-Adapted Inter- the performance, on average, up to 16.3% compared to the action 12 (2002) 331–370. doi:10.1023/A:10212407 single best model averaged over all datasets. 30564 . Our experiments show that greedy ensemble selection [6] P. Gijsbers, M. L. Bueno, S. Coors, E. LeDell, S. Poirier, enhances performance and avoids the need for manual al- J. Thomas, B. Bischl, J. Vanschoren, Amlb: an automl gorithm selection. By ensembling a subset of all available benchmark, Journal of Machine Learning Research 25 algorithms, greedy ensemble selection consistently achieves (2024) 1–65. better results than any single algorithm across all included [7] J. Heinermann, O. Kramer, Machine learning ensem- datasets. However, ensembling introduces an additional bles for wind power prediction, Renewable Energy 89 step in the recommender systems pipeline. (2016) 671–679. Despite its performance boost, greedy ensemble selection [8] N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Lar- for top-n recommendations is expensive compared to single roy, M. Li, A. Smola, Autogluon-tabular: Robust and factorization- and neighborhood-based models. In addition accurate automl for structured data, 2020. URL: https: to adding complexity to the pipeline, ensembling requires //arxiv.org/abs/2003.06505. arXiv:2003.06505 . the training and optimization of multiple models to utilize [9] M. Ferrari Dacrema, P. Cremonesi, D. Jannach, Are their predictions for the top-n recommendations. Increasing we really making much progress? a worrying analy- the overall complexity and computational cost. sis of recent neural recommendation approaches, in: Nevertheless, the research community appears willing to Proceedings of the 13th ACM Conference on Recom- accept higher computational costs for better performance, as mender Systems, RecSys ’19, Association for Comput- evidenced by more sophisticated algorithms and the grow- ing Machinery, New York, NY, USA, 2019, p. 101–109. ing use of deep-learning approaches. While greedy ensem- URL: https://doi.org/10.1145/3298689.3347058. ble selection involves an exhaustive search, further research doi:10.1145/3298689.3347058 . could optimize the ensembling process for greater efficiency. [10] A. Bar, L. Rokach, G. Shani, B. Shapira, A. Schclar, Currently, there is an ongoing debate in the field about Improving Simple Collaborative Filtering Models Us- whether recommender systems are truly making progress ing Ensemble Methods, in: D. Hutchison, T. Kanade, [9]. Much of the current research focuses on developing J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, new (deep-learning) approaches, which do not necessarily M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Stef- outperform well-optimized traditional models. Revisiting fen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y. Vardi, and adapting ensembling for top-n recommendations, par- G. Weikum, Z.-H. Zhou, F. Roli, J. Kittler (Eds.), Multi- ticularly with easy-to-train traditional recommender sys- ple Classifier Systems, volume 7872, Springer Berlin tem algorithms, could open a new research direction. By Heidelberg, Berlin, Heidelberg, 2013, pp. 1–12. URL: adapting and improving advanced ensembling methods, rec- http://link.springer.com/10.1007/978-3-642-38067-9_1. ommender systems could significantly enhance their per- doi:1 0 . 1 0 0 7 / 9 7 8- 3 - 6 4 2 - 3 8 0 6 7 - 9 _1 , series Title: formance, especially for top-n predictions. Lecture Notes in Computer Science. [11] R. Boim, T. Milo, Methods for boosting recommender 6.1. Future Work systems, in: 2011 IEEE 27th International Conference on Data Engineering Workshops, 2011, pp. 288–291. Future work can investigate the contribution of more ad- doi:10.1109/ICDEW.2011.5767667 . vanced deep-learning algorithms to the ensembling process. [12] A. da Costa Fortes, M. G. Manzato, Ensemble Learn- This includes assessing the models’ potential performance ing in Recommender Systems: Combining Multiple enhancements and comparing the overall performance to User Interactions for Ranking Personalization, in: Pro- state-of-the-art deep-learning methods. Furthermore, ana- ceedings of the 20th Brazilian Symposium on Multi- lyzing more efficient strategies to build an effective ensemble media and the Web, WebMedia ’14, Association for is valuable. Finally, examining the impact of ensembling Computing Machinery, New York, NY, USA, 2014, pp. across different domains within recommender systems helps 47–54. URL: https://doi.org/10.1145/2664551.2664556. better understand domain-specific trends. doi:10.1145/2664551.2664556 . [13] A. F. da Costa, M. G. Manzato, Exploiting multimodal References interactions in recommender systems with ensemble algorithms, Information Systems 56 (2016) 120–132. [1] A. Toscher, M. Jahrer, R. M. Bell, The bigchaos solution URL: https://www.sciencedirect.com/science/article/ to the netοΏ½ix grand prize (????). pii/S0306437915300818. doi:10.1016/j.is.2015.09 [2] D. H. Wolpert, Stacked generalization, Neural Net- .007 . works 5 (1992) 241–259. doi:https://doi.org/10.1 [14] D. Carmel, E. Haramaty, A. Lazerson, L. Lewin-Eytan, 016/S0893- 6080(05)80023- 1 . Multi-Objective Ranking Optimization for Product [3] E. Γ‡ano, M. Morisio, Hybrid recommender systems: Search Using Stochastic Label Aggregation, in: Pro- A systematic literature review, Intell. Data Anal. 21 ceedings of The Web Conference 2020, WWW ’20, (2017) 1487–1524. URL: https://doi.org/10.3233/IDA-1 Association for Computing Machinery, New York, NY, 63209. doi:10.3233/IDA- 163209 . USA, 2020, pp. 373–383. URL: https://doi.org/10.1145/ 3366423.3380122. doi:10.1145/3366423.3380122 . [15] P. Nguyen, J. Dines, J. Krasnodebski, A Multi-Objective IJCAI ’13, AAAI Press, 2013, p. 2719–2725. Learning to re-Rank Approach to Optimize Online [26] I. Cantador, P. Brusilovsky, T. Kuflik, Second work- Marketplaces for Multiple Stakeholders, 2017. URL: shop on information heterogeneity and fusion in rec- http://arxiv.org/abs/1708.00651. doi:10.48550/arXiv ommender systems (hetrec2011), in: Proceedings .1708.00651 , arXiv:1708.00651 [cs]. of the Fifth ACM Conference on Recommender Sys- [16] N. H. Kulkarni, G. N. Srinivasan, B. M. Sagar, N. K. Cau- tems, RecSys ’11, Association for Computing Ma- very, Improving Crop Productivity Through A Crop chinery, New York, NY, USA, 2011, p. 387–388. URL: Recommendation System Using Ensembling Tech- https://doi.org/10.1145/2043932.2044016. doi:10.114 nique, in: 2018 3rd International Conference on 5/2043932.2044016 . Computational Systems and Information Technology [27] F. M. Harper, J. A. Konstan, The movielens datasets: for Sustainable Solutions (CSITSS), 2018, pp. 114–119. History and context, ACM Trans. Interact. Intell. Syst. doi:10.1109/CSITSS.2018.8768790 . 5 (2015). URL: https://doi.org/10.1145/2827872. doi:10 [17] H. Wu, K. Yue, Y. Pei, B. Li, Y. Zhao, F. Dong, Collabo- .1145/2827872 . rative topic regression with social trust ensemble for [28] B. Frederickson, Fast python collaborative filtering for recommendation in social media systems, Knowledge- implicit datasets, URL https://github. com/benfred/im- Based Systems 97 (2016). doi:10.1016/j.knosys.201 plicit (2018). 6.01.011 . [29] M. D. Ekstrand, Lenskit for python: Next-generation [18] S. Forouzandeh, K. Berahmand, M. Rostami, Presenta- software for recommender systems experiments, in: tion of a recommender system with ensemble learning Proceedings of the 29th ACM International Con- and graph embedding: a case on MovieLens, Mul- ference on Information & Knowledge Management, timedia Tools and Applications 80 (2021) 7805–7832. CIKM ’20, Association for Computing Machinery, URL: https://doi.org/10.1007/s11042-020-09949-5. New York, NY, USA, 2020, p. 2999–3006. URL: https: doi:10.1007/s11042- 020- 09949- 5 . //doi.org/10.1145/3340531.3412778. doi:10.1145/3340 [19] T. Vente, L. Purucker, J. Beel, The feasibility of greedy 531.3412778 . ensemble selection for automated recommender sys- tems, in: COSEAL Workshop 2022, 2022. URL: https: //www.researchgate.net/publication/373841225_The _Feasibility_of_Greedy_Ensemble_Selection_for_Au tomated_Recommender_Systems. [20] T. Vente, M. Ekstrand, J. Beel, Introducing lenskit- auto, an experimental automated recommender sys- tem (autorecsys) toolkit, in: Proceedings of the 17th ACM Conference on Recommender Systems, 2023, pp. 1212–1216. [21] T.-T. Wong, P.-Y. Yeh, Reliable accuracy estimates from k-fold cross validation, IEEE Transactions on Knowledge and Data Engineering 32 (2020) 1586–1594. doi:10.1109/TKDE.2019.2912815 . [22] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang, Bert4rec: Sequential recommendation with bidirec- tional encoder representations from transformer, in: Proceedings of the 28th ACM International Confer- ence on Information and Knowledge Management, CIKM ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 1441–1450. URL: https: //doi.org/10.1145/3357384.3357895. doi:10.1145/3357 384.3357895 . [23] Z. Yue, Z. He, H. Zeng, J. McAuley, Black-box attacks on sequential recommenders via data-free model ex- traction, in: Proceedings of the 15th ACM Conference on Recommender Systems, RecSys ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 44–54. URL: https://doi.org/10.1145/3460231.3474275. doi:10.1145/3460231.3474275 . [24] Z. Yue, H. Zeng, Z. Kou, L. Shang, D. Wang, Defending substitution-based profile pollution attacks on sequen- tial recommenders, in: Proceedings of the 16th ACM Conference on Recommender Systems, RecSys ’22, As- sociation for Computing Machinery, New York, NY, USA, 2022, p. 59–70. URL: https://doi.org/10.1145/35 23227.3546770. doi:10.1145/3523227.3546770 . [25] H. Wang, B. Chen, W.-J. Li, Collaborative topic re- gression with social regularization for tag recommen- dation, in: Proceedings of the Twenty-Third Inter- national Joint Conference on Artificial Intelligence,