A First Analysis of Meta-Learned Per-Instance Algorithm Selection in Scholarly Recommender Systems Andrew Collins Joeran Beel ADAPT Centre, School of ADAPT Centre, School of Computer Science and Statistics, Computer Science and Statistics, Trinity College Dublin, Ireland Trinity College Dublin, Ireland ancollin@tcd.ie beelj@tcd.ie ABSTRACT The effectiveness of recommender system algorithms varies in 1 Introduction different real-world scenarios. It is difficult to choose a best Recommendation algorithm performance varies in different algorithm for a scenario due to the quantity of algorithms scenarios. A reliable intuition about what algorithms are best available, and because of their varying performances. suited to a given scenario can be elusive even to recommender Furthermore, it is not possible to choose one single algorithm system experts [17], and it is generally accepted that manual that will work optimally for all recommendation requests. We experimentation is required [4]. Correctly choosing an optimal, apply meta-learning to this problem of algorithm selection for single, algorithm however, will reduce the effectiveness of the scholarly article recommendation. We train a random forest, system overall [2][10]. It is not possible to choose one gradient boosting machine, and generalized linear model, to algorithm that works optimally for all recommendation predict a best-algorithm from a pool of content similarity- requests. based algorithms. We evaluate our approach on an offline Within real-world recommendation scenarios, algorithm dataset for scholarly article recommendation and attempt to performance is unpredictable. For example, a near-to-online predict the best algorithm per-instance. The best meta-learning evaluation of recommendation algorithms across 6 online news model achieved an average increase in F1 of 88% when websites showed that the performance of approaches was compared to the average F1 of all base-algorithms (F1; 0.0708 inconsistent [3]; a “most popular” algorithm performed best on vs 0.0376) and was significantly able to correctly select each one website (cio.de; precision: 0.56) and was the worst on base-algorithm (Paired t-test; p < 0.1). The meta-learner had a another (ksta.de; precision: 0.01). In the domain of scholarly 3% higher F1 when compared to the single-best base- article recommendation, we performed an online evaluation of algorithm (F1; 0.0739 vs 0.0717). We further perform an online 33.5M recommendations delivered across multiple evaluation of our approach, conducting an A/B test through our applications and found similarly inconsistent algorithm recommender-as-a-service platform Mr. DLib. We deliver 148K performances [8]; the best performing algorithm in one recommendations to users between January and March 2019. application (Document embeddings; Click-through rate (CTR): User engagement was significantly increased for 0.21%) was the worst performing in another (CTR: 0.02%). recommendations generated using our meta-learning Algorithm performance is also unpredictable when approach when compared to a random selection of algorithm considered at a per-instance level. In another previous (Click-through rate (CTR); 0.51% vs. 0.44%, Chi-Squared test; evaluation we found that, for example, the overall-worst p < 0.1), however our approach did not produce a higher CTR performing collaborative filtering algorithm for prediction in than the best algorithm alone (CTR; MoreLikeThis (Title): MovieLens datasets was frequently more accurate than all 0.58%). other algorithms for each user-item rating prediction [9]. If you can accurately predict when such an algorithm is optimal for CCS CONCEPTS each user-item prediction, significant gains in recommender system performance can be achieved (e.g., picking an optimal • Information Systems → Recommender Systems algorithm per-instance improves RMSE by 25.5% over the overall-best algorithm in this case [9]). KEYWORDS We face these challenges of variable and inconsistent Meta-learning, Algorithm Selection, TF-IDF, Online Evaluation algorithm performance as operators of the scholarly recommender-as-a-service Mr. DLib. We work with diverse ComplexRec 2019, 20 September 2019, Copenhagen, Denmark 2019. partners who each have different users, different websites or Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). applications, different corpora, etc. It is unknown how algorithms will perform for new partners, or for each ComplexRec 2019, Copenhagen, Denmark A. Collins and J. Beel recommendation request that is made, and it is not sufficient to Table 1. An illustration of the dataset used to train our meta- choose algorithms that have performed well for previous learners. The target is a categorical variable that indicates the partners [8]. best algorithm and search fields for this instance according to an Meta-learning for algorithm selection aims to predict the F1-score best algorithm to use in a given situation. It does this by learning the relationship between characteristics of data, and Title Title Researcher Doc. Collection length length Best Algorithm the performance of algorithms for that data [19][25]. For ID ID ID (chars) (words) (F1) example, ‘sparsity’ is a characteristic of ratings data (a meta- y1 1062 P00 104 14 MLT:title,abstract feature), and it might be learned that collaborative filtering y1 1064 P00 45 5 MLT:title,abstract algorithms are more effective on non-sparse datasets. Meta- y1 1029 P03 88 11 StandardQuery:title,abstract learning is useful when distinct algorithms perform differently … … … … … … in various situations, when those situations that can be y10 1053 P04 61 8 MLT:title,abstract characterized numerically, and when this performance Collins et al. [10] predict the best algorithm per-instance variation can be learned. from a pool of collaborative filtering algorithms using two In this paper we apply meta-learning for algorithm MovieLens datasets (100K and 1M). Their meta-learners were selection to the task of scholarly article recommendation. We unable to discriminate algorithm performance adequately aim to select the best algorithm for each data instance, and for (RMSE, 100K: 0.973; 1M: 0.908) and performed 2-3% worse each recommendation request received. We evaluate several than overall-best single algorithm SVD++ (RMSE, 100K: 0.942; approaches using a gold-standard offline dataset and deploy 1M: 0.887). the most promising approach to a live recommender system for Edenhofer et al. [14] use per-instance meta-learning on an online evaluation. offline dataset and find that a gradient boosting model improves recommender system effectiveness slightly when compared to the overall-best algorithm. 2 Related Work There are no online evaluations of instance-based meta- Meta-learning for algorithm selection in recommender systems learners for recommender systems to the best of our is typically used to predict a single best-algorithm for entire knowledge1. The effectiveness of meta-learning for algorithm datasets [10]. Several offline evaluations of algorithm selection selection in live recommender systems is not known, as far as using meta-learning for recommender systems exist in which we are aware. best-algorithms are predicted per-dataset, for example, Cunha et al. [11, 13], Romero et al. [22], Matuszyk and Spiliopoulou [21]. Furthermore, tools exist to automatically evaluate 3 Methodology algorithm pools for recommender systems at a per-dataset We hypothesize that there might be a relationship between level, such as librec-auto [20]. See Cunha et al. [12] for a survey text-based attributes of recommendation requests to a of recommender system-related algorithm selection literature scholarly article recommender system, and the performances and Smith-Miles [23] for a wider survey of meta-learning for of content-similarity algorithms for those requests. For algorithm selection. example, the length of a text query may alter a TF-IDF-based Per-instance meta-learning is well explored in other fields; algorithm’s effectiveness in a recommendation scenario, as see Kotthoff [18] for a comprehensive overview of per-instance shown in other domains [7]. If such a relationship can be meta-learning and its successful application to combinatorial learned by a meta-learner, then we expect that algorithm search problems for example. selection can improve engagement with scholarly article There are not many examples of meta-learning being used recommendations through the use of a more appropriate for recommender systems at lower levels of granularity than algorithm for a given request. per-dataset, that is, predicting the best algorithm for subsets of To evaluate meta-learning models’ abilities to learn such a data, or per-instance. relationship and increase user engagement we perform a two- Ekstrand and Riedl use a meta-learner to choose the best stage evaluation. We first identify candidate meta-learning algorithm per-user, from a small pool of algorithms, using a models via an offline evaluation. We then deploy any promising MovieLens dataset [15]. They predict an algorithm using two model to a live recommender system for an online evaluation. meta-features: the number of ratings by a user, and the variance of a user’s ratings. Their meta-learner performs 3.1. Offline Evaluation slightly worse than the overall-best single algorithm (RMSE; We approach our offline evaluation as a classification task. 0.78 vs 0.77). We use the Scholarly Paper Recommendation Dataset [24]. 1 A comprehensive summary of algorithm selection literature is available here: https://larskotthoff.github.io/assurvey/ Algorithm Selection in Scholarly Recommender Systems ComplexRec 2019, Copenhagen, Denmark This dataset contains 597 papers from a corpus of scholarly publications about Computational Linguistics (ACL Anthology Reference Corpus 2 ). The interests of 28 researchers are described in the dataset, and these researchers have manually indicated what papers within the corpus are relevant for them. A median of 30 relevant papers per-researcher are indicated. We import the titles and abstracts for all 597 documents into Solr3. We use two content similarity-based search algorithms built into Solr for our evaluation. For each document, in each researcher’s repository, we perform four queries as follows: 1) Two searches using Solr’s standard query parser. We use the title from each row of the dataset and search all documents in the corpus on either their title fields, or, search all documents using both their title and abstract fields. Fig. 1. Recommendations generated by Mr. DLib and displayed in 2) Two searches using Solr’s MoreLikeThis (MLT) class. Jabref. Recommendations are generated for the currently MoreLikeThis constructs a term vector from either selected repository item (indicated by a black arrow) and are the title field, or both the title and abstract fields, and displayed in a vertical list of up to 7 items (indicated with a returns similar documents. dashed red box). We perform a 4-fold cross validation for each of these Both approaches use TF-IDF and a scoring formula similar to models. For each instance in the validation set from each fold, cosine similarity 4 to rank results. The substantive difference we predict the best algorithm and perform a Solr search using between these two approaches is that the standard query this predicted-best algorithm. We evaluate their results parser will only use the title from each instance to find results, according to precision, recall, and F1. and MLT may use the abstract also. We further evaluate each base-algorithm across each We compare the documents returned from each of the four validation set. As a simple baseline we also use a randomly queries to the remaining documents marked as relevant by that selected algorithm for each row. To illustrate the upper-bound researcher. We rank the algorithms according to the F1 score of effectiveness that might be expected, we also use a “best- of the retrieved results and note the best performing algorithm. algorithm oracle” that uses the actual-best algorithm per- We also derive the length of the querying document’s title in instance. characters and words, and note the collection that the querying Our offline evaluation mirrors our online system, but with document is in. If all algorithms return zero results for a row, gold-standard data. 580 of the 597 documents in the small the row is removed. This results in a dataset with 750 instances Scholarly Paper Recommendation Dataset are contained in the (Table 1). corpus used in our online evaluation. We aim to learn a relationship between the performance of the base-algorithms and characteristics of the instance data. To 3.2. Online Evaluation do this, we train and evaluate three models: a random forest, generalized linear model, and gradient boosting machine5. As Mr. DLib is a recommendation-as-a-service provider that features we use the Collection ID 6 , Title Length (characters), delivers scholarly related-article recommendations for its and Title Length (words). The training target (meta-target) for partners. It is a white label of Darwin & Goliath [5]. Mr. DLib each model is the label of the actual best algorithm per-instance partners with Jabref, an open-source reference management according to F1 (i.e., the Best Algorithm column in Table 1). software, and delivers recommendations to their users [16]. The simple features that we use for this offline evaluation are Mr. DLib in-part uses the same content-similarity based representative of the limited features that we are able to use algorithms that we use in our offline evaluation to generate with some partners in our live recommender system. recommendations, specifically TF-IDF similarity using: Solr’s standard query parser and a requesting document’s title, and 2 http://acl-arc.comp.nus.edu.sg/ 6 The collection ID contains a letter indicating the venue the item was 3 We use the Cleaned ACL ARC dataset for titles and abstracts, available here: published in along with the type of publication https://web.eecs.umich.edu/~lahiri/acl_arc.html [5] (proceedings/journal/workshop), and 2 numbers indicating year of publication. 4 https://lucene.apache.org/core/7_7_0/core/org/apache/lucene/ For a description of ACL’s naming convention see: search/similarities/TFIDFSimilarity.html https://aclweb.org/anthology/info/contrib/ 5 We use H2O’s implementations of these models ComplexRec 2019, Copenhagen, Denmark A. Collins and J. Beel Table 2. Precision, recall, and F1 for each base-algorithm when clicked, this gives a CTR of = 0.9%. We assume that if an selected arbitrarily, and for each base-algorithm when chosen by a meta-learner model (Random Forest, Generalized Linear algorithm is effective then, on average, users will interact with Model, Gradient Boosting Machine). The highest precision, recall, recommendations from this algorithm more frequently than and F1 for each base-algorithm is bolded. The overall per- recommendations from a less effective one. This will manifest instance meta-learner performance for each model is listed, with as a higher CTR for the more effective algorithm. the best precision, recall and F1 also highlighted We train and deploy the most effective meta-learning model from our offline evaluation. We use 2 months of user- Algorithm Precision Recall F1 interaction data to train this model, logged from November MoreLikeThis (Title) 2018 to January 2019. Our training set comprises Arbitrarily Selected 0.012 0.001 0.002 recommendations that were clicked by users and describe each Random Forest Meta-learner 0.383 0.039 0.070 querying document’s title length (words), title length (characters), and the hour of the day that a request was Generalized Linear Model Meta-learner 0.000 0.000 0.000 received. Recommendations are infrequently clicked and so Gradient Boosting Machine Meta-learner 0.300 0.028 0.050 our training set only includes ~1% of total recommendations MoreLikeThis (Title, Abstract) delivered in this period. The target of our model is the label of Arbitrarily Selected 0.313 0.042 0.072 the algorithm used to generate these previously clicked Random Forest Meta-learner 0.327 0.044 0.075 requests. Algorithms were selected with equal probability Generalized Linear Model Meta-learner 0.080 0.011 0.018 during the period in which training click-data was collected. Gradient Boosting Machine Meta-learner 0.332 0.046 0.077 During the evaluation period, half of recommendation Standard Query Parser (Title) requests received by Mr. DLib were fulfilled using the Arbitrarily Selected 0.117 0.022 0.035 predicted-best algorithm from our meta-learner. Unlike the Random Forest Meta-learner 0.191 0.047 0.072 offline evaluation, only one algorithm can be used for any specific recommendation request/instance in this live Generalized Linear Model Meta-learner 0.032 0.005 0.008 recommender system, therefore the remaining half of Gradient Boosting Machine Meta-learner 0.142 0.034 0.052 recommendation requests are fulfilled by a random selection of Standard Query Parser (Title, Abstract) algorithm (MoreLikeThis search, standard query search) and Arbitrarily Selected 0.123 0.027 0.041 search field (title, title and abstract). Mr. DLib only uses Random Forest Meta-learner 0.215 0.040 0.066 MoreLikeThis if a querying document is indexed in Solr 8. In the Generalized Linear Model Meta-learner 0.056 0.012 0.019 case that a querying document is not in Mr. DLib’s corpus, a Gradient Boosting Machine Meta-learner 0.218 0.051 0.077 fallback algorithm is used. Our evaluation is based on recommendations delivered to Best Algorithm Oracle 0.370 0.060 0.098 users of Jabref between January 2019 and March 2019. Random Algorithm 0.137 0.024 0.038 Per-Instance Meta-learners Overall Random Forest Meta-learner 0.303 0.044 0.074 4 Results Generalized Linear Model Meta-learner 0.168 0.028 0.046 Gradient Boosting Machine Meta-learner 0.286 0.044 0.073 4.1. Offline Evaluation Results Results from our offline evaluation are shown in Table 2. For using Solr’s MoreLikeThis on a requesting document that is each base-algorithm, precision, recall and F1 are always higher known and in Mr. DLib’s corpus. for instances where the algorithm is selected by a meta-learner, Mr. DLib makes recommendations to users of Jabref from versus when the base-algorithm is used alone (i.e., arbitrarily 120M documents in the CORE 7 collection of open access selected). The random forest achieves a significantly higher F1 research papers. Jabref makes a request for recommendations (Paired t-test; p < 0.1) on a per-algorithm basis over arbitrary using the title of the currently selected document in the user’s use of the base-algorithms (88% increase, average F1; 0.0708 repository (Fig. 1). Related-articles are then recommended vs 0.0376), followed by gradient boosting (71% increase, based on text and metadata from documents in the corpus, average F1; 0.0644 vs 0.0376). A generalized linear model including the title and abstract. Recommendations are produces a lower average F1 than the base-algorithms (70% displayed to users in a vertical list of up to 7 items (Fig. 1). decrease, average F1; 0.0114 vs 0.0376). The performance for We measure the effectiveness of each base-algorithm using each meta-learner overall, i.e., when the meta-learner selects click-through rate (CTR). This is the ratio of clicked what algorithm to use per-instance, is also listed in Table 2. Of recommendations to delivered recommendations. For the three meta-learning models, the random forest achieves the example, if 1000 recommendations are delivered and 9 are 7 https://core.ac.uk/ 8 MoreLikeThis can also use an external resource to conduct a search. To minimize recommendation response time Mr. DLib does not use this mechanism Algorithm Selection in Scholarly Recommender Systems ComplexRec 2019, Copenhagen, Denmark 1.40% 5 Discussion and Conclusion Clickthrough Rate (CTR) Our online evaluation shows that a random forest meta- 1.20% learner, using a requesting document’s title length in words 1.00% and characters and the hour of the day that a request was 0.80% 0.60% received as meta-features, is less effective than MoreLikeThis 0.40% (Title) alone (CTR; 0.58% vs 0.51%) and is only equally 0.20% effective as the Standard Query Parser using just the Title (Fig. 0.00% Algorithm CTRs when Selected Randomly Algorithm CTRs when Selected by Metalearner 2). Based on this evaluation, MoreLikeThis (Title) should be Standard Query Parser (Title) 0.51% 0.48% used instead of this meta-learning approach, or any of the other MoreLikeThis (Title) Standard Query Parser 0.58% 0.82% algorithms examined. However, although the meta-learner’s 0.35% 0.48% (Title, Abstract) MoreLikeThis (Title, effectiveness is not better than the overall best algorithm, the 0.46% 1.17% Abstract) Random Forest Metalearner results are encouraging in that the meta-learner is, to some 0.51% 0.51% (overall) Algorithms Selected extent, capable of learning when an algorithm performs best. 0.44% 0.44% Randomly (overall) This is indicated by the meta-learner’s higher effectiveness Fig. 2. Clickthrough rates for each base-algorithm, when used when compared to a random selection of algorithm. It is randomly, and when selected by the random forest meta-learner furthermore encouraging that these results are based on a in Mr. DLib. A random forest meta-learner produces an overall small number of simple meta-features, and that the training set 15.04% increase in CTR over a random selection of algorithm but considers only one simple form of implicit feedback, i.e., clicks, is less effective than the best individual algorithm, evaluated on in order to predict a best algorithm. 148,088 recommendations. In contrast to this online evaluation, our offline evaluation highest precision, recall, and F1. The overall-F1 of the random showed that algorithm selection was effective with the dataset forest is 3% higher than the overall-best base-algorithm (F1; we used, that is, with a subset of Mr. DLib’s main corpus along 0.0739 vs 0.0717 for MoreLikeThis (Title, Abstract)) indicating with gold standard indications of recommendation relevance. the meta-learner is effective in selecting algorithms Each individual algorithm was always more effective when appropriately. chosen by a meta-learner than when used arbitrarily. The upper bound on performance for these base- Furthermore, the overall F1 for the random forest and gradient algorithms is indicated by the ‘best algorithm oracle’, which boosting machine meta-learners was higher than any achieves a 32% higher F1 than the best meta-learner individual algorithm. The discrepancy between our offline and online results highlights the need to examine approaches in a live setting. 4.2. Online Evaluation Results We have evaluated these algorithms previously [8] and Mr. DLib delivered 148,088 recommendations to users found a significant variation in effectiveness across during the evaluation period, using the base-algorithms recommendation scenarios, that is, in different applications, for selected randomly, and using base-algorithms as selected by different users. The rank of algorithms according to their our random forest meta-learner. Overall there were 719 clicks effectiveness also differed, with the best algorithm in one upon recommendations, giving a total average click-through scenario being the worst in another, and vice versa. These rate of 0.49%. This average click-through rate seems low but is aberrations occur even with common corpora. The single-best consistent with previous large-scale evaluations that we have algorithm could therefore not be assumed without such an performed comprising 100M recommendations [6]. online evaluation. We feel that the meta-learning approach Results from our online evaluation are shown in Fig. 2. outlined is a simple alternative to a lengthy evaluation, or an Recommendations from three of the four base-algorithms arbitrary/random choice of algorithm. achieve a higher CTR when this base-algorithm is selected by To the best of our knowledge, this is the first online the random forest meta-learner. MoreLikeThis (Title and evaluation of meta-learning for algorithm selection in a Abstract) sees a ~150% increase in CTR when chosen by the scholarly recommender system. We hope that these results can meta-learner. Overall, the random forest meta-learner results be improved upon. Further work includes the use of a more in a 15.04% increase in CTR over a random selection of substantial dataset for offline evaluation9. More discriminative algorithms (CTR; 0.51% vs. 0.44%, Chi-Squared test; p < 0.1). meta-features should be evaluated, such as text features that However, the meta-learner’s overall CTR is not higher than the correspond to a predictable performance for retrieval methods, best two algorithms individually (CTR; Standard Query Parser e.g., average query IDF [1]. Furthermore, it will be necessary to (Title): 0.51%, MoreLikeThis (Title): 0.58%). examine the effectiveness of algorithm selection not only per- instance, but also across multiple scenarios and partners. 9 E.g., the large Scholarly Paper Recommendation Dataset. It contains 100,531 papers and lists the interests of 50 researchers [22]. Currently this dataset is not complete and does not include reference information for papers. ComplexRec 2019, Copenhagen, Denmark A. Collins and J. Beel [21] Matuszyk, P. and Spiliopoulou, M. 2014. Predicting the performance of ACKNOWLEDGMENTS collaborative filtering algorithms. Proceedings of the 4th International This publication has emanated from research conducted with Conference on Web Intelligence, Mining and Semantics (WIMS14) (2014), 38. the financial support of Science Foundation Ireland (SFI) under [22] Romero, C., Olmo, J.L. and Ventura, S. 2013. A meta-learning approach Grant Number 13/RC/2106. for recommending a subset of white-box classification algorithms for Moodle datasets. Educational Data Mining 2013 (2013). [23] Smith-Miles, K.A. 2009. Cross-disciplinary perspectives on meta- REFERENCES learning for algorithm selection. ACM Computing Surveys (CSUR). 41, 1 [1] Arora, S. and Yates, A. 2019. Investigating Retrieval Method Selection (2009), 6. with Axiomatic Features. 1st Interdisciplinary Workshop on Algorithm [24] Sugiyama, K. and Kan, M.-Y. 2013. Exploiting potential citation papers in Selection and Meta-Learning in Information Retrieval (AMIR). (2019). scholarly paper recommendation. Proceedings of the 13th ACM/IEEE-CS [2] Beel, J. 2017. A Macro/Micro Recommender System for joint conference on Digital libraries (2013), 153–162. Recommendation Algorithms [Proposal]. ResearchGate. (2017). [25] Vilalta, R., Giraud-Carrier, C.G., Brazdil, P. and Soares, C. 2004. Using [3] Beel, J., Breitinger, C., Langer, S., Lommatzsch, A. and Gipp, B. 2016. Meta-Learning to Support Data Mining. IJCSA. 1, 1 (2004), 31–45. Towards reproducibility in recommender-systems research. User modeling and user-adapted interaction. 26, 1 (2016), 69–101. [4] Beel, J., Gipp, B., Langer, S. and Breitinger, C. 2016. Research-paper Recommender Systems: a Literature Survey. International Journal on Digital Libraries. 17, 4 (2016), 305–338. [5] Beel, J., Griffin, A. and O’Shea, C. 2019. Darwin & Goliath: Recommendations-As-a-Service with Automated Algorithm-Selection and White-Labels. Proceedings of the 13th ACM Conference on Recommender Systems (2019). [6] Beel, J., Smyth, B. and Collins, A. 2019. RARD II: The 94 Million Related- Article Recommendation Dataset. Proceedings of the 1st Interdisciplinary Workshop on Algorithm Selection and Meta-Learning in Information Retrieval (AMIR) (2019). [7] Chung, T.L., Luk, R.W.P., Wong, K.F., Kwok, K.L. and Lee, D.L. 2006. Adapting pivoted document-length normalization for query size: Experiments in Chinese and English. ACM Transactions on Asian Language Information Processing (TALIP). 5, 3 (2006), 245–263. [8] Collins, A. and Beel, J. 2019. Document Embeddings vs. Keyphrases vs. Terms for Recommender Systems: A Large-Scale Online Evaluation. Proceedings of the 19th ACM/IEEE on Joint Conference on Digital Libraries. (2019). [9] Collins, A., Beel, J. and Tkaczyk, D. 2018. One-at-a-time: A Meta-Learning Recommender-System for Recommendation-Algorithm Selection on Micro Level. arXiv preprint arXiv:1805.12118. (2018). [10] Collins, A., Tkaczyk, D. and Beel, J. 2018. A Novel Approach to Recommendation Algorithm Selection using Meta-Learning. 26th AIAI Irish Conference on Artificial Intelligence and Cognitive Science. (2018), 210–219. [11] Cunha, T., Soares, C. and Carvalho, A.C. de 2018. CF4CF: recommending collaborative filtering algorithms using collaborative filtering. Proceedings of the 12th ACM Conference on Recommender Systems (2018), 357–361. [12] Cunha, T., Soares, C. and Carvalho, A.C. de 2018. Metalearning and Recommender Systems: A literature review and empirical study on the algorithm selection problem for Collaborative Filtering. Information Sciences. 423, (2018), 128–144. [13] Cunha, T., Soares, C. and Carvalho, A.C. de 2016. Selecting Collaborative Filtering algorithms using Metalearning. Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2016), 393– 409. [14] Edenhofer, G., Collins, A., Aizawa, A. and Beel, J. 2019. Augmenting the DonorsChoose.org Corpus for Meta-Learning. 1st Interdisciplinary Workshop on Algorithm Selection and Meta-Learning in Information Retrieval (AMIR). (2019). [15] Ekstrand, M. and Riedl, J. 2012. When recommenders fail: predicting recommender failure for algorithm selection and combination. Proceedings of the sixth ACM conference on Recommender systems (2012), 233–236. [16] Feyer, S., Siebert, S., Gipp, B., Aizawa, A. and Beel, J. 2017. Integration of the scientific recommender system Mr. DLib into the reference manager JabRef. European Conference on Information Retrieval (2017), 770–774. [17] Gomez-Uribe, C.A. and Hunt, N. 2015. The Netflix Recommender System: Algorithms, Business Value, and Innovation. ACM Trans. Manage. Inf. Syst. 6, 4 (2015), 13:1–13:19. [18] Kotthoff, L. 2016. Algorithm selection for combinatorial search problems: A survey. Data Mining and Constraint Programming. Springer. 149–190. [19] Lemke, C., Budka, M. and Gabrys, B. 2015. Metalearning: a survey of trends and technologies. Artificial intelligence review. 44, 1 (2015), 117– 130. [20] Mansoury, M. and Burke, R. 2019. Algorithm Selection with Librec-auto. AMIR@ ECIR (2019), 11–17.