Greedy Ensemble Selection for Top-N Recommendations
                         Tobias Vente1 , Zainil Mehta1 , Lukas Wegmeth1 and Joeran Beel1
                         1
                             Intelligent Systems Group, University of Siegen, Germany


                                           Abstract
                                           Despite the pivotal role ensembling played in the success of BellKors’ Pragmatic Chaos Team in winning the Netflix Prize challenge in
                                           the early 2000s, it never became a standard practice in recommender systems. In contrast, ensembling, particularly greedy ensemble
                                           selection, has become a standard practice in machine learning to enhance performance compared to a single model. Despite the success
                                           of greedy ensemble selection in classification and regression tasks, it has not been adapted for top-n prediction tasks. Hence, in this
                                           study, we aim to analyze the potential of greedy ensemble selection to boost the performance of recommender system models for top-n
                                           prediction tasks. We adapt the concept of greedy ensemble selection for top-n prediction tasks, train and optimize ten factorization-
                                           and neighborhood-based models on five datasets, and compare the performance of the ensemble to that of the individual models. Our
                                           experiments reveal that greedy ensemble selection always performs better than a single model and enhances performance by an average
                                           of 8.8% on NDCG@5, 8.6% on NDCG@10, and 16.3% on NDCG@20 compared to the single best model.

                                           Keywords
                                           Ensembling, Recommender Systems, Algorithm Selection, Automatic Algorithm Selection


                         1. Introduction                                                                                              ranking performance. We focus on ten fast and easy-to-train
                                                                                                                                      factorization- and neighborhood-based models. We then
                         Ensembling played a pivotal role for BellKors’ Pragmatic                                                     evaluate the ensemble output of these ten models on five
                         Chaos Team, enhancing their recommender system to win                                                        datasets using NDCG@k, with k set to 5, 10, and 20, aiming
                         the Netflix Prize challenge in the early 2000s [1, 2]. Despite                                               to quantify the performance and robustness improvements
                         its success in the competition, ensembling did not become                                                    compared to single optimized models.
                         a standard practice in the field of recommender systems.                                                        Our contribution is the implementation of greedy en-
                         Today, mainly hybrid recommender systems rely on ensem-                                                      semble selection for top-n ranking prediction tasks, along
                         bling trained on different data or to aggregate predictions                                                  with a comprehensive analysis of its performance impact
                         [3, 4]. Thereby, often combining collaborative filtering with                                                and robustness compared to single optimized models. Our
                         content-based models to cancel out the weaknesses of indi-                                                   results indicate that greedy ensemble selection improves
                         vidual models like the cold-start problem [5].                                                               performance by an average of 8.8% on NDCG@5, 8.6% on
                            In comparison, in machine learning, ensembling, particu-                                                  NDCG@10, and 16.3% on NDCG@20 compared to the single
                         larly greedy ensemble selection, is a standard practice [6],                                                 best model on five datasets. Additionally, while no single
                         enhancing performance by as much as 37% in best-case                                                         model performs best across all datasets, greedy ensemble
                         scenarios and improving robustness [7]. Moreover, in auto-                                                   selection consistently performs best, representing the most
                         mated machine learning, ensembling significantly enhances                                                    robust recommender with regard to performance.
                         performance to the extent that some tools prioritize ensem-                                                     The implementation of greedy ensemble selection, along
                         bling over further hyperparameter optimization [8].                                                          with the code and necessary documentation to reproduce all
                            However, despite the success of greedy ensemble selec-                                                    experiments, is publicly available in our GitHub repository1 .
                         tion for regression and classification, it has never been
                         adapted for top-n ranking prediction tasks [6]. While ensem-
                         bling has proven effective in machine learning, it remains                                                   2. Related Work
                         a largely overlooked approach in the field of recommender
                         systems. In recommender systems, researchers continue                                                        The use of ensembling techniques in recommender systems
                         to debate whether the field makes progress, yet the focus                                                    is not new and has been covered in the literature [4, 3, 5, 10].
                         primarily remains on continuously developing more sophis-                                                       Today, primarily hybrid recommender systems use ensem-
                         ticated algorithms [9]. Instead of implementing a new, more                                                  bling to mitigate the weaknesses of individual algorithms
                         complex recommender system algorithm, we want to focus                                                       [3, 5, 4]. Thereby, hybrid recommender systems require
                         on ensembling already existing algorithms.                                                                   knowledge of the strengths and weaknesses of different al-
                            Therefore, we want to analyze the potential of greedy                                                     gorithms to ensemble them effectively. In contrast to hybrid
                         ensemble selection for top-n prediction tasks and answer                                                     recommender systems, our work focuses on ensembling rec-
                         the question: RQ: How does the ensembling of factorization-                                                  ommender system algorithms without manually selecting
                         and neighborhood-based models impact performance and                                                         complementary algorithm combinations.
                         robustness compared to a single optimized model?                                                                As in our work, researchers have applied ensembling
                            In this work, we adapt greedy ensemble selection for                                                      techniques that do not require manual model selection for
                         top-n prediction tasks to assess its potential for enhancing                                                 ensembling. For example, they have used standard machine
                                                                                                                                      learning ensemble techniques, such as bagging and boosting
                                                                                                                                      to recommender systems [11, 10], allowing the ensembling
                          RobustRecSys: Design, Evaluation, and Deployment of Robust Recom-
                          mender Systems Workshop @ RecSys 2024, 18 October, 2024, Bari, Italy.
                                                                                                                                      of a diverse set of models without manual model selection.
                          Envelope-Open tobias.vente@uni-siegen.de (T. Vente);                                                        However, their work mostly focuses on rating prediction
                          zainil.mehta@student.uni-siegen.de (Z. Mehta);                                                              tasks. Furthermore, bagging and boosting require the modi-
                          lukas.wegmeth@uni-siegen.de (L. Wegmeth);                                                                   fication of training data. We focus on post-hoc ensembling,
                          joeran.beel@uni-siegen.de (J. Beel)                                                                         only taking model predictions into account.
                          Orcid 0009-0003-8881-2379 (T. Vente); 0009-0002-0556-9493 (Z. Mehta);
                          0000-0001-8848-9434 (L. Wegmeth); 0000-0002-4537-5573 (J. Beel)                                             1
                                                                                                                                          https://github.com/ISG-Siegen/greedy-ensemble-selection-for-top-n-r
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).                                                         ecommendations


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Additionally, researchers have analyzed ensembling for         subsets of 𝑝𝑛 ∈ 𝑃 and aggregating the 𝑘 ′ predictions by ag-
various other aspects of recommender systems. Researchers         gregating and re-ranking the prediction scores of all in the
ensemble predictions of models trained on different datasets      subset included models (Algorithm 1). If items appear in
containing different user feedback types [12, 13]. Re-            the 𝑘 ′ lists of multiple models, the item scores are summed
searchers focus on ensembling techniques to ensemble mod-         to reflect their collective relevance across models. We then
els optimized for different objectives [14, 15]. Researchers      select the top-𝑘 predictions of the ensembled list. The per-
implement ensembling techniques specifically designed and         formance of every subset of all models 𝑝𝑛 ∈ 𝑃 is evaluated on
tested for certain domains, applications, or with a limited       the validation set. The best-performing ensemble of models
number of base models [16, 14, 17]. Or optimize ensembling        is then selected as the final result of the greedy ensemble
techniques to specific datasets to showcase the capabilities      selection, and their top-𝑘 predictions are returned.
of ensembling without focusing on the generalization [18].
However, these approaches have limitations: they often
only work with models from the same algorithm, require            4. Experiments
multiple data inputs, necessitate optimization for multiple
                                                                  We conducted all of our experiments with ten differ-
objectives, or focus ensembling for rating predictions.
                                                                  ent factorization- and neighborhood-based algorithms and
   Recent work has focused on greedy ensemble selection
                                                                  greedy ensemble selection on five datasets. The hardware
for recommender systems [19, 20]. In this work, the authors
                                                                  includes AMD EPYC 7452 CPU processors, each with 32
applied greedy ensemble selection to rating prediction tasks
                                                                  cores and a CPU frequency ranging from 2.35 to 3.35 GHz.
by treating them as regression problems, thus utilizing the
standard greedy ensemble selection approach for regres-
sion. However, this method does not offer solutions for           4.1. Experimental Pipeline
top-n ranking predictions, leaving a gap in the application       In our experimental pipeline, we apply five-fold cross-
of greedy ensemble selection in recommender systems.              validation to all five datasets, randomly splitting each fold
                                                                  into three sets: 60% for training, 20% for validation, and
3. Greedy Ensemble Selection                                      20% for testing. With the training and validation sets, we
                                                                  optimize all included algorithms with two hours of ran-
Greedy ensemble selection, as implemented in machine              dom search to select the best hyperparameter configuration.
learning, cannot be directly applied to top-n recommen-           With the test set, we evaluate the final performance of the
dations in recommender systems. Greedy ensemble selec-            single models as well as the greedy ensemble selection. This
tion for classification applies majority voting on predictions.   combination of five-fold cross-validation and random search
However, majority voting fails for top-n recommendations          allows each algorithm to be finely tuned on every subset of
since the number of repeating item recommendations across         the data while mitigating the effects of randomness in data
users is often insufficient. Similarly, taking the mean for re-   splits and hyperparameter selection [21].
gression tasks is not applicable, as top-n recommendations           We measure performance using NDCG@k for 𝑘 = 5, 10, 20
deal with ranked lists instead of single numeric values re-       to evaluate top-n ranking predictions for different list
turned by each model. Therefore, we focus on aggregating          lengths. The NDCG@k model performance on the validation
and re-ranking the prediction scores of multiple models to        set, obtained from the random search optimization process,
generate an ensemble of their outputs.                            is later used to weight the model predictions (Section 3).
   To apply greedy ensemble selection, we assume we have
a set of trained and optimized models 𝑃, each predicting 𝑘 ′      4.1.1. Datasets
items and their validation performance. We aim to aggregate
the 𝑘 ′ predictions of every model 𝑝𝑛 ∈ 𝑃 into one ranked list    We include five distinct datasets of different sizes in our
of length 𝑘. This requires the length 𝑘 ′ of every predicted      experiments and refer to Table 1 for a detailed overview.
list 𝑝𝑛 to be at least the same length as 𝑘.                      We transform convert datasets with user ratings, specifically
   In ensembling, we can utilize more predictions (𝑘 ′ ) than     Movielens-1M, Movielens-100k, and CiaoDVD, into binary
the desired output list length (𝑘). There is a chance that        user feedback datasets as it is done in related work [4, 3, 5].
a prediction ranked at position 𝑘 + 1 or beyond still holds       Furthermore, we prune all datasets such that all included
relevance or contributes valuable information, even though        users and items have at least five interactions, commonly
it does not make it into the top 𝑘 predictions of the model.      known as five-core pruning [22, 23, 24]. Table 1 shows all
Taking 𝑘 ′ predictions into account enables the ensemble pro-     included datasets’ statistics after preprocessing.
cess to utilize a broader range of data, potentially improving
the performance of the final ensembled recommendations.           Table 1
Furthermore, utilizing 𝑘 ′ predictions does not increase the      Data set statistics after five-core pruning and user feedback trans-
prediction cost of the base models since all models score all     formation. Split between the implicit (first part) and explicit (sec-
                                                                  ond part) feedback data sets.
predictions anyway before selecting the top 𝑘.
   We normalize all prediction scores of 𝑘 ′ and multiply           Name                   Interactions   Users    Items    Sparsity
each prediction score by the validation performance of the          Citeulike-a[25]        200,180        5,536    15,429   99.77%
respective model 𝑝𝑛 (Algorithm 1). The normalization en-            Hetrec-Lastfm[26]      71,355         1,859    2,823    98.64%
sures that all models have an equal impact in the ensembling,       CiaoDVD2               23,467         1,582    1,788    99.17%
while the validation performance multiplication weights the         MovieLens-1M[27]       835,789        6,038    3,307    95.81%
                                                                    MovieLens-100k[27]     81,697         943      1,203    92.8%
impact based on the models’ performance. Consequently,
the impact of models with good validation performance will
be increased relative to those of poorly performing models.
   Then, we initiate the greedy search by examining all
     Table 2
     NDCG@10 performance of ten factorization- and neighborhood-based models, along with greedy ensemble selection, across
     five datasets. The best results for individual models are indicated in bold, while the overall best performance is highlighted in
     bold and underlined. The relative performance increase is calculated based on the performance of Popularity.

 Algorithms           CiaoDVD       CiteULike-A     Hetrec-LastFM      MovieLens-1M                                   MovieLens-100k        Rel. Performance Increase
 ALS                  0.02          0.066           0.147              0.234                                          0.232                 84%
 BPR                  0.013         0.027           0.082              0.119                                          0.173                 9%
 ImplicitMF           0.022         0.109           0.159              0.189                                          0.184                 75%
 ItemItem-BM25        0.024         0.106           0.168              0.234                                          0.221                 99%
 ItemItem-Cosine      0.01          0.082           0.169              0.217                                          0.21                  82%
 ItemItem-TFIDF       0.016         0.094           0.168              0.225                                          0.218                 90%
 ItemKNN              0.013         0.096           0.175              0.216                                          0.212                 88%
 LogisticMF           0.018         0.058           0.135              0.161                                          0.191                 49%
 UserKNN              0.026         0.112           0.157              0.235                                          0.225                 99%
 Popularity           0.016         0.009           0.07               0.142                                          0.142                 0%
 Greedy Ensemble      0.03          0.117           0.183              0.247                                          0.243                 108%


4.1.2. Algorithms                                                                                               Average Ensembling Performance on Various k' Values
We include ten factorization- and neighborhood-based rec-                                                             Greedy Ensemble Selection
                                                                                                                      Single Best Algorithm


                                                                          Relative NDCG@10 Performance
ommender systems algorithms in our experiments. The                                                      108%

algorithm implementations are from the Implicit [28] and
LensKit [29] recommender systems libraries. The algo-                                                    105%
rithms from Implicit are Alternating Least Squares (ALS),
Logistic Matrix Factorization (LogisticMF), Bayesian Person-                                             103%
alized Ranking (BPR), and Item-Item Nearest Neighbors with
distance metrics Cosine Similarity, TF-IDF, and BM25. The
                                                                                                         101%
algorithms from LensKit are Implicit Matrix Factorization
(ImplicitMF), User-User Nearest Neighbors (UserKNN), Item-
                                                                                                         99%
Item Nearest Neighbors (ItemKNN), and most Popular.

                                                                                                                 5     10     15   25     50      75   100   125   150
4.2. Greedy Ensemble Selection                                                                                                 k' Top N Recommendations

We run greedy ensemble selection using various prediction
input list lengths (𝑘 ′ ) to examine the impact of predictions           Figure 1: Performance differences of greedy ensemble selection
                                                                         with varying input prediction list lengths (𝑘 ′ ) compared to the
ranked higher than 𝑘 on the ensembling process (Section 3).
                                                                         virtual single best algorithm, averaged over five datasets. The x-
We set 𝑘 ′ to 5, 10, 15, 25, 50, 75, 100, 125, and 150. This wide        axis represents the input prediction list lengths (𝑘 ′ ), and the y-axis
range of 𝑘 ′ values helps us identify trends in the impact               shows the NDCG@10 performance. The shaded band represents
of longer input list lengths. All ensemble configurations                the confidence interval of 95% for NDCG@10.
are evaluated on the validation set. Ultimately, we select
the ensemble configuration that performs best on average
across all folds, with the optimal 𝑘 ′ value.                            NDCG@k scores on all five datasets. This approach reliably
                                                                         enhances performance and presents robust results compared
                                                                         to the single best model respectively.
5. Results                                                                  On average, the overall performance advantage of greedy
                                                                         ensemble selection compared to the single best algorithm is
Our experiments reveal that greedy ensemble selection en-                8.6% on NDCG@10 (108% vs. 99% for UserKNN, Table 2), but
hances performance by an average of 8.8% on NDCG@5,                      varies across datasets. Greedy ensemble selection shows a
8.6% on NDCG@10 (Table 2), and 16.3% on NDCG@20 com-                     performance increase as high as 15.4% on CiaoDVD (0.03 vs.
pared to the single best model. Since NDCG@10 is the most                0.026 for UserKNN, Section 5) and as low as 0.7% on Hetrec-
commonly used evaluation metric with a cutoff of 𝑘 = 10                  LastFM (0.175 vs. 0.174 for ItemKNN, Section 5). On datasets
and the trends are consistent across all 𝑘 values, our analysis          like CiteULike-A, Movielens-1M, and MovieLens-100k, the
will focus on the NDCG@10 results.                                       performance increase is approximately 5%.
   In general, the algorithm performance ranking varies                     Longer prediction input lists of length 𝑘 ′ (Section 3) im-
across datasets. While Popularity always yields the lowest               prove the overall model performance (Fig. 1). Predictions
NDCG@10 score (Table 1), the best-performing algorithm                   that do not make it into the final 𝑘 predictions of the single
changes. UserKNN performs best on CiaoDVD, CiteULike-                    models still contribute valuable information to the ensem-
A, and MovieLens-1M, while ItemKNN performs best on                      ble process. While ensembling 𝑘 ′ = 𝑘 predictions already
Hetrec-LastFM and ALS on MovieLens-100k.                                 enhance performance, increasing 𝑘 ′ can further improve
   In contrast to the single algorithm performances, greedy              results. We tested using up to 150 predictions per user from
ensemble selection consistently outperforms all algorithms               each model and observed that the ensemble’s performance
across all datasets for all NDCG@𝑘 values and adverts the                plateaued beyond 𝑘 ′ = 100 predictions (Fig. 1). Additionally,
algorithm selection problem. Greedy ensemble selection                   increasing 𝑘 ′ beyond this point incurs higher computational
effectively identifies and aggregates a subset of models that            costs during the ensembling process without yielding sig-
outperforms the single best model, resulting in the highest              nificant performance gains.
6. Discussion                                                      [4] R. Burke, Hybrid Systems for Personalized Recom-
                                                                       mendations, volume 3169 of Lecture Notes in Computer
To comprehensively answer our research question: How                   Science, Springer Berlin Heidelberg, Berlin, Heidelberg,
does the ensembling of factorization- and neighborhood-                2005, p. 133–152. URL: http://link.springer.com/10.100
based models impact performance compared to a single opti-             7/11577935_7. doi:10.1007/11577935_7 .
mized model? We conclude that greedy ensemble selection            [5] R. Burke, Hybrid recommender systems: Survey and
of factorization- and neighborhood-based models enhances               experiments, User Modeling and User-Adapted Inter-
the performance, on average, up to 16.3% compared to the               action 12 (2002) 331–370. doi:10.1023/A:10212407
single best model averaged over all datasets.                          30564 .
   Our experiments show that greedy ensemble selection             [6] P. Gijsbers, M. L. Bueno, S. Coors, E. LeDell, S. Poirier,
enhances performance and avoids the need for manual al-                J. Thomas, B. Bischl, J. Vanschoren, Amlb: an automl
gorithm selection. By ensembling a subset of all available             benchmark, Journal of Machine Learning Research 25
algorithms, greedy ensemble selection consistently achieves            (2024) 1–65.
better results than any single algorithm across all included       [7] J. Heinermann, O. Kramer, Machine learning ensem-
datasets. However, ensembling introduces an additional                 bles for wind power prediction, Renewable Energy 89
step in the recommender systems pipeline.                              (2016) 671–679.
   Despite its performance boost, greedy ensemble selection        [8] N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Lar-
for top-n recommendations is expensive compared to single              roy, M. Li, A. Smola, Autogluon-tabular: Robust and
factorization- and neighborhood-based models. In addition              accurate automl for structured data, 2020. URL: https:
to adding complexity to the pipeline, ensembling requires              //arxiv.org/abs/2003.06505. arXiv:2003.06505 .
the training and optimization of multiple models to utilize        [9] M. Ferrari Dacrema, P. Cremonesi, D. Jannach, Are
their predictions for the top-n recommendations. Increasing            we really making much progress? a worrying analy-
the overall complexity and computational cost.                         sis of recent neural recommendation approaches, in:
   Nevertheless, the research community appears willing to             Proceedings of the 13th ACM Conference on Recom-
accept higher computational costs for better performance, as           mender Systems, RecSys ’19, Association for Comput-
evidenced by more sophisticated algorithms and the grow-               ing Machinery, New York, NY, USA, 2019, p. 101–109.
ing use of deep-learning approaches. While greedy ensem-               URL: https://doi.org/10.1145/3298689.3347058.
ble selection involves an exhaustive search, further research          doi:10.1145/3298689.3347058 .
could optimize the ensembling process for greater efficiency.     [10] A. Bar, L. Rokach, G. Shani, B. Shapira, A. Schclar,
   Currently, there is an ongoing debate in the field about            Improving Simple Collaborative Filtering Models Us-
whether recommender systems are truly making progress                  ing Ensemble Methods, in: D. Hutchison, T. Kanade,
[9]. Much of the current research focuses on developing                J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell,
new (deep-learning) approaches, which do not necessarily               M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Stef-
outperform well-optimized traditional models. Revisiting               fen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y. Vardi,
and adapting ensembling for top-n recommendations, par-                G. Weikum, Z.-H. Zhou, F. Roli, J. Kittler (Eds.), Multi-
ticularly with easy-to-train traditional recommender sys-              ple Classifier Systems, volume 7872, Springer Berlin
tem algorithms, could open a new research direction. By                Heidelberg, Berlin, Heidelberg, 2013, pp. 1–12. URL:
adapting and improving advanced ensembling methods, rec-               http://link.springer.com/10.1007/978-3-642-38067-9_1.
ommender systems could significantly enhance their per-                doi:1 0 . 1 0 0 7 / 9 7 8- 3 - 6 4 2 - 3 8 0 6 7 - 9 _1 , series Title:
formance, especially for top-n predictions.                            Lecture Notes in Computer Science.
                                                                  [11] R. Boim, T. Milo, Methods for boosting recommender
6.1. Future Work                                                       systems, in: 2011 IEEE 27th International Conference
                                                                       on Data Engineering Workshops, 2011, pp. 288–291.
Future work can investigate the contribution of more ad-               doi:10.1109/ICDEW.2011.5767667 .
vanced deep-learning algorithms to the ensembling process.        [12] A. da Costa Fortes, M. G. Manzato, Ensemble Learn-
This includes assessing the models’ potential performance              ing in Recommender Systems: Combining Multiple
enhancements and comparing the overall performance to                  User Interactions for Ranking Personalization, in: Pro-
state-of-the-art deep-learning methods. Furthermore, ana-              ceedings of the 20th Brazilian Symposium on Multi-
lyzing more efficient strategies to build an effective ensemble        media and the Web, WebMedia ’14, Association for
is valuable. Finally, examining the impact of ensembling               Computing Machinery, New York, NY, USA, 2014, pp.
across different domains within recommender systems helps              47–54. URL: https://doi.org/10.1145/2664551.2664556.
better understand domain-specific trends.                              doi:10.1145/2664551.2664556 .
                                                                  [13] A. F. da Costa, M. G. Manzato, Exploiting multimodal
References                                                             interactions in recommender systems with ensemble
                                                                       algorithms, Information Systems 56 (2016) 120–132.
 [1] A. Toscher, M. Jahrer, R. M. Bell, The bigchaos solution          URL: https://www.sciencedirect.com/science/article/
     to the net�ix grand prize (????).                                 pii/S0306437915300818. doi:10.1016/j.is.2015.09
 [2] D. H. Wolpert, Stacked generalization, Neural Net-                .007 .
     works 5 (1992) 241–259. doi:https://doi.org/10.1             [14] D. Carmel, E. Haramaty, A. Lazerson, L. Lewin-Eytan,
     016/S0893- 6080(05)80023- 1 .                                     Multi-Objective Ranking Optimization for Product
 [3] E. Çano, M. Morisio, Hybrid recommender systems:                  Search Using Stochastic Label Aggregation, in: Pro-
     A systematic literature review, Intell. Data Anal. 21             ceedings of The Web Conference 2020, WWW ’20,
     (2017) 1487–1524. URL: https://doi.org/10.3233/IDA-1              Association for Computing Machinery, New York, NY,
     63209. doi:10.3233/IDA- 163209 .                                  USA, 2020, pp. 373–383. URL: https://doi.org/10.1145/
                                                                       3366423.3380122. doi:10.1145/3366423.3380122 .
[15] P. Nguyen, J. Dines, J. Krasnodebski, A Multi-Objective          IJCAI ’13, AAAI Press, 2013, p. 2719–2725.
     Learning to re-Rank Approach to Optimize Online             [26] I. Cantador, P. Brusilovsky, T. Kuflik, Second work-
     Marketplaces for Multiple Stakeholders, 2017. URL:               shop on information heterogeneity and fusion in rec-
     http://arxiv.org/abs/1708.00651. doi:10.48550/arXiv              ommender systems (hetrec2011), in: Proceedings
     .1708.00651 , arXiv:1708.00651 [cs].                             of the Fifth ACM Conference on Recommender Sys-
[16] N. H. Kulkarni, G. N. Srinivasan, B. M. Sagar, N. K. Cau-        tems, RecSys ’11, Association for Computing Ma-
     very, Improving Crop Productivity Through A Crop                 chinery, New York, NY, USA, 2011, p. 387–388. URL:
     Recommendation System Using Ensembling Tech-                     https://doi.org/10.1145/2043932.2044016. doi:10.114
     nique, in: 2018 3rd International Conference on                  5/2043932.2044016 .
     Computational Systems and Information Technology            [27] F. M. Harper, J. A. Konstan, The movielens datasets:
     for Sustainable Solutions (CSITSS), 2018, pp. 114–119.           History and context, ACM Trans. Interact. Intell. Syst.
     doi:10.1109/CSITSS.2018.8768790 .                                5 (2015). URL: https://doi.org/10.1145/2827872. doi:10
[17] H. Wu, K. Yue, Y. Pei, B. Li, Y. Zhao, F. Dong, Collabo-         .1145/2827872 .
     rative topic regression with social trust ensemble for      [28] B. Frederickson, Fast python collaborative filtering for
     recommendation in social media systems, Knowledge-               implicit datasets, URL https://github. com/benfred/im-
     Based Systems 97 (2016). doi:10.1016/j.knosys.201                plicit (2018).
     6.01.011 .                                                  [29] M. D. Ekstrand, Lenskit for python: Next-generation
[18] S. Forouzandeh, K. Berahmand, M. Rostami, Presenta-              software for recommender systems experiments, in:
     tion of a recommender system with ensemble learning              Proceedings of the 29th ACM International Con-
     and graph embedding: a case on MovieLens, Mul-                   ference on Information & Knowledge Management,
     timedia Tools and Applications 80 (2021) 7805–7832.              CIKM ’20, Association for Computing Machinery,
     URL: https://doi.org/10.1007/s11042-020-09949-5.                 New York, NY, USA, 2020, p. 2999–3006. URL: https:
     doi:10.1007/s11042- 020- 09949- 5 .                              //doi.org/10.1145/3340531.3412778. doi:10.1145/3340
[19] T. Vente, L. Purucker, J. Beel, The feasibility of greedy        531.3412778 .
     ensemble selection for automated recommender sys-
     tems, in: COSEAL Workshop 2022, 2022. URL: https:
     //www.researchgate.net/publication/373841225_The
     _Feasibility_of_Greedy_Ensemble_Selection_for_Au
     tomated_Recommender_Systems.
[20] T. Vente, M. Ekstrand, J. Beel, Introducing lenskit-
     auto, an experimental automated recommender sys-
     tem (autorecsys) toolkit, in: Proceedings of the 17th
     ACM Conference on Recommender Systems, 2023, pp.
     1212–1216.
[21] T.-T. Wong, P.-Y. Yeh, Reliable accuracy estimates
     from k-fold cross validation, IEEE Transactions on
     Knowledge and Data Engineering 32 (2020) 1586–1594.
     doi:10.1109/TKDE.2019.2912815 .
[22] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang,
     Bert4rec: Sequential recommendation with bidirec-
     tional encoder representations from transformer, in:
     Proceedings of the 28th ACM International Confer-
     ence on Information and Knowledge Management,
     CIKM ’19, Association for Computing Machinery,
     New York, NY, USA, 2019, p. 1441–1450. URL: https:
     //doi.org/10.1145/3357384.3357895. doi:10.1145/3357
     384.3357895 .
[23] Z. Yue, Z. He, H. Zeng, J. McAuley, Black-box attacks
     on sequential recommenders via data-free model ex-
     traction, in: Proceedings of the 15th ACM Conference
     on Recommender Systems, RecSys ’21, Association for
     Computing Machinery, New York, NY, USA, 2021, p.
     44–54. URL: https://doi.org/10.1145/3460231.3474275.
     doi:10.1145/3460231.3474275 .
[24] Z. Yue, H. Zeng, Z. Kou, L. Shang, D. Wang, Defending
     substitution-based profile pollution attacks on sequen-
     tial recommenders, in: Proceedings of the 16th ACM
     Conference on Recommender Systems, RecSys ’22, As-
     sociation for Computing Machinery, New York, NY,
     USA, 2022, p. 59–70. URL: https://doi.org/10.1145/35
     23227.3546770. doi:10.1145/3523227.3546770 .
[25] H. Wang, B. Chen, W.-J. Li, Collaborative topic re-
     gression with social regularization for tag recommen-
     dation, in: Proceedings of the Twenty-Third Inter-
     national Joint Conference on Artificial Intelligence,