EvalRS: a rounded evaluation of recommender systems Jacopo Tagliabue1,2,∗,† , Federico Bianchi3,† , Tobias Schnabel4,† , Giuseppe Attanasio5,† , Ciro Greco1,2,† , Gabriel de Souza P. Moreira6,† and Patrick John Chia7,† 1 South Park Commons, New York, NY, USA 2 Coveo Labs, New York, NY, USA 3 Stanford University, Stanford, CA, USA 4 Microsoft, Redmond, WA, USA 5 Bocconi University, Milan, Italy 6 NVIDIA, São Paulo, Brazil 7 Coveo, Montreal, Canada Abstract Much of the complexity of recommender systems (RSs) comes from the fact that they are used as part of highly diverse real-world applications which requires them to deal with a wide array of user needs. However, research has focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow focus has limited the capacity of RSs to have a lasting impact in the real world and makes them vulnerable to undesired behavior, such as the reinforcement of data biases. We propose E v a l R S as a new type of challenge, in order to foster this discussion among practitioners and build in the open new methodologies for testing RSs “in the wild”. Keywords recommender systems, behavioral testing, open source 1. Introduction 1. We propose and standardize the data, evaluation loop and testing for RSs over a popular use case Recommender systems (RSs) are embedded in most appli- (user-item recommendations for music consump- cations we use today. From streaming services to online tion [2]), thus releasing in the open domain a first retailers, the accuracy of a RS is a key factor in the success unified benchmark for this topic. of many products. Evaluation of RSs has often been done 2. We bring together the community on evaluation considering point-wise metrics, such as HitRate (HR) or from both an industrial and research point of view, nDCG over held-out data points, but the field has recently to foster an inclusive debate for a more nuanced begun to recognize the importance of a more rounded evaluation of RSs. evaluation as a better proxy to real-world performance [1]. In this paper, we describe the conceptual and practical We designed E v a l R S as a new type of data challenge in motivations behind E v a l R S , provide context on the orga- which participants are asked to test their models incorpo- nizers, related events and relevant literature, and explain rating quantitative as well as behavioral insights. Using the evaluation methodology we champion. For partici- a popular open dataset – Last.fm – we go beyond sin- pation rules, up-to-date implementation details and all gle aggregate numbers and instead require participants the artifacts produced before and during the challenge, to optimize for a wide range of recommender systems please refer to the E v a l R S official repository.1 properties. The contribution of this challenge is two-fold: EvalRS 2022: CIKM EvalRS 2022 DataChallenge, October 21, 2022, 2. Motivation Atlanta, GA ∗ Corresponding author. E v a l R S at CIKM 2022 complements the existing challenge † TS proposed the format and methodology and worked with JT and landscape and it is driven by two different perspectives: FB towards a first draft. PC led the implementation and contributed most of the R e c L i s t code. GA, CG, FB and PC researched, iterated the first one coming from academic research, the sec- and operationalized behavioral tests. GM reviewed the API and ond one from the industrial development of RSs. We implemented baselines, while GA, JT and FB prepared tutorials for examined these in turn. participants. Everybody helped with drafting the paper, rules and guidelines. JT and FB acted as senior PIs in the project. JT and CG started this work at Coveo Labs, New York, NY, USA. Envelope-Open tagliabue.jacopo@gmail.com (J. Tagliabue) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) https://github.com/RecList/evalRS-CIKM-2022. 2.1. A Research Perspective over frequent items, in reality, new users and items can represent a big portion of them with Although undeniable progress was made in the past years, significant business consequences: the cold-start concerns have been raised about the status of research problem is believed to affect 50% of users [21] advancements in the field of recommendations, partic- in a context where field studies found that 40% ularly with respect to ephemeral processes in motivat- of shoppers would stop shopping if shown non- ing architectural choices and lack of reproducibility [3]. relevant recommendations [22]. This challenge draws attention to a further – and poten- • Use cases and industry idiosyncrasies: different tially deeper – issue: even if the “reproducibility crisis” use cases in different industries present differ- is solved, we are still mostly dealing with point-wise ent challenges. For instance, recommendations quantitative metrics as the only benchmarks for RSs. As for complementary items in e-commerce need reported by Sun et al. [4], the dominating metrics used to account for the fact that if item A is a good in the evaluation of recommender systems published at complementary candidate for item B, the reverse top-tier conferences (RecSys, SIGIR, CIKM) are standard might not hold (e.g. an HDMI cable is a good com- information retrieval metrics, such as MRR, Recall, HITS, plementary item for a 4k TV, but not vice versa). NDCG [5, 6, 7, 8, 9]. Music recommendations need to deal with the While it is undoubtedly convenient to summarize issue of “hubness”, where popular items act as the performance of different models via one score, this hubs in the top-N recommendation list of many lossy projection discards a lot of important information users without being similar to the users’ profiles on model behavior: for example, given the power-law and making other items invisible to the recom- distribution in many real-world datasets ([10, 11, 12]), mender [23]. Such use-case specific traits are marginal improvements on frequent items may translate particularly important when designing effective in noticeable accuracy gains, even at the cost of signifi- testing procedures and often require considerable cantly degrading the experience of subgroups. Metrics domain knowledge. such as coverage, serendipity, and bias [13, 14, 15] are a first step in the right direction, but they still fall short of • Not all mistakes are equal: point-wise metrics are capturing the full complexity of deploying RSs. unable to distinguish different types of mistakes; Following the pioneering work of [16] in Natural this is especially problematic for recommender Language Processing, we propose to supplement stan- systems, as even a single mistake may cause great dard retrieval metrics with new tests: in particular, social and reputational damage [24]. we encourage practitioners to go beyond the false di- • Robustness matters as much as accuracy: while his- chotomy “quantitative-and-automated” vs “qualitative- torically a significant part of industry effort can be and-manual”, and find a middle ground in which be- traced back to a few key players, there is a bloom- havioral desiderata can be expressed transparently in ing market of Recommendation-as-a-Service sys- code [1]. tems designed to address the needs of “reasonable scale” systems [25]. Instead of vertical scaling and extreme optimization, SaaS providers emphasize 2.2. An Industrial Perspective horizontal scaling through multiple deployments, RSs in practice differ from RSs used in research in crucial highlighting the importance of models that prove ways. For example, in research, a static dataset is used to be flexible and robust across many dimensions repeatedly, and there is no real interactivity between the (e.g., traffic, industry, etc.). model and users: prediction over a given point in time 𝑥𝑡 in the test set doesn’t change what happens at 𝑥𝑡+1 2 . While not related to model evaluation per se, decision- Even without considering the complexity of reproducing making processes in the real world would also take into real-world interactions for benchmarking purposes, we account the different resources used by competing ap- highlight four important themes from our experience in proaches: time (both as time for training and latency for building RSs at scale in production scenarios: serving), computing (CPU vs GPU), CO2 emissions are all typically included in an industry benchmark. • Cold-start performance: new/rare items and users are challenging for many models across indus- tries [19, 20]. In e-commerce, for instance, while 3. EvalRS Challenge most “similar products” predictions will happen We propose to supplement standard retrieval metrics over held out data points with behavioral tests: in be- 2 This is especially important in the context of sequential recom- havioral tests, we treat the target model as a black-box mender[17], which arguably resembles more reinforcement learn- and supply only input-output pairs (for example, query ing than supervised inference with pseudo-feedback [18]. user and desired recommended song). In particular, we Table 1 leverage a recent open-source package, R e c L i s t [1], to Descriptive statistics for LFM dataset. prepare a suite of tests for our target dataset (Section Items Value 3.1). In putting forward our tests, we operationalize the intuitions from Section 2 through a general plug-and- Users 119, 555 play API to facilitate model comparison and data prepa- Artists 62, 943 ration, and by providing convenient abstractions and Albums 1, 374, 121 ready-made recommenders used as baselines. Tracks 820, 998 Listening Events 37, 926, 429 User-Track History Length (25/50/75 pct) 241/346/413 3.1. Use Case and Dataset E v a l R S is a user-item recommendation challenge in the music domain: participants are asked to train a model • We then performed another iteration of k-core that, given a user id, recommends an appropriate song out filtering, this time on the u s e r - t r a c k interaction of a known set of songs. The ground truth necessary to graph, with 𝑘 = 10 to retain only users and tracks compute all the test metrics, quantitative and behavioral, which are informative. is provided by our leave-one-out framework: for each • Lastly, the original dataset contained missing user, we remove a song from their listening history and meta-data (e.g. there were t r a c k _ i d in the events use it as the ground truth when evaluating the models. data which did not have corresponding track We provide test abstractions and an evaluation script metadata). We removed tracks, albums, artists designed for LFM, a transformed version of LFM-1b and events which had missing information. dataset [2] – a dataset focused on music consumption on • We summarize the final dataset statistics in Table Last.fm. We chose the LFM-1b dataset as the primary data 1. source after a thorough comparisons of popular datasets for a unique combination of features. Given our focus Taken together, these features allow us to fulfill E v a l R S on rounded evaluation and the importance of joining promise of offering a challenging setting and a rounded prediction / ground truth with meta-data, LFM is an ideal evaluation. While a clear motivation behind the release of dataset, as it provides rich song (artist, album informa- LFM-1b dataset was to offer “additional user descriptors tion) and user (country, age, gender,3 time on platform) that reflect their music taste and consumption behavior”, meta-data. it is telling that both the modelling and the evaluation We applied principled data transformations to make by the original authors are still performed without any E v a l R S amenable to a larger audience whilst preserving real use of these rich meta-data [27]. By taking a fresh the rich information in the original dataset. We detail look at an existing, popular dataset, E v a l R S challenges the data transformation process and our motivations: practitioners to think about models not just along famil- iar quantitative dimensions, but also along non-standard • First, we removed u s e r s and a r t i s t s which have scores closer to human perception of relevance and fair- few interaction since they are likely to be too ness. sparse to be informative. Following the sugges- tions in, we apply k-core [26] filtering to the bipar- 3.2. Evaluation Metrics tite interaction graph between u s e r s and a r t i s t s , setting 𝑘 = 10 (i.e. we retain vertices with a mini- Submission are evaluated according to our randomized mum degree of k). loop (Section 3.3) over the testing suite released with the • After the aforementioned processing, the dataset challenge. At a first glance, tests can be roughly divided still contained over 900M events, which moti- in three main groups: vated further filtering of the data. In particular, we keep only the first interaction a u s e r had with • Standard RSs metrics: these are the typical a given t r a c k , and for each u s e r we retain only point-wise metrics used in the field (e.g. MRR, their 𝑁 = 500 most recent unique t r a c k interac- HR@K) – they are included as sanity checks and tions. We supplement the information lost dur- as a informative baseline against which insights ing this pruning step by providing the interaction gained through the behavioral tests can be inter- count between a u s e r and a t r a c k . preted. • Standard metrics on a per-group or slice ba- sis: as shown for example in [1], models which 3 Gender in the original dataset is a binary variable. This is a limita- are indistinguishable on the full test set may ex- tion, as it gives a stereotyped representation of gender. Our intent hibit very different behavior across data slices. is not to make normative claims about gender. It is therefore crucial to quantify model perfor- on each slice and the the MR obtained on the original mance for specific input and target groups, i.e. test set is averaged and negated (so that a higher value is there a performance difference between males implies better performance in the metric) to obtain the and females? Is there an accuracy drops when final score for each test. The slice-based tests considered artists are not very popular? for the final scores are: • Behavioral tests: this group may include per- • Gender balance. This test is meant to address turbance tests (i.e. if we modify a user’s history fairness towards gender [32]. Since the dataset by swapping Metallica with Pantera, how much only provides binary gender, the test will mini- will predictions change?), and error distance tests mize the difference between the MR obtained on (i.e. if the ground truth is Shine On You Crazy Di- users who specified Female as gender and the MR amond and the prediction is Smoke on the Water, obtained on the original test set. In other words, how severe is this error?). the smaller the difference, the fairer the model towards potential gender biases. Based on this taxonomy, we now survey the tests im- • Artist popularity. This test is meant to address plemented in the R e c L i s t powering E v a l R S , with refer- a known problem in music recommendations: ence to relevant literature and examples from the target niche (or simply less known) artists and users datasets. For implementation details please refer to the who are less interested in highly popular con- official repository.4 tent are often penalized by recommender systems [33, 34]. This point appears even more important 3.2.1. Standard RSs metrics when we consider that several music streaming Based on popular metrics in the literature, we picked two services (e.g. Spotify, Tidal) also act as market- standard metrics as a quantitative baseline and sanity places for artists to promote their music. Since check for our R e c L i s t : splitting the test set in two would draw an arbi- trary line between popular vs. unpopular artists, • Mean Reciprocal Rank (MRR) as a measure of failing to capture the actual properties of the dis- where the first relevant element retrieved by the tribution. Instead, we split the test set into bins model is ranked in the output list. Besides be- with equal size after logarithmic scaling. ing considered a standard rank-aware evaluation • User country. Music consumption is subject metric, we chose MRR because it is particularly to many country dependent factors, such as lan- simple to compute and to interpret. guage differences, local sub-genres and styles, lo- • Hit Rate (HR), defined as Recall at k (𝑘 = 100), cal licensing and distribution laws, cultural in- i.e. the proportion of relevant items found in the fluences of local traditional music, etc [35]. We top-k recommendation. capture this diversity by slicing the test set based on the top-10 countries by user counts. 3.2.2. Standard metrics on a per-group or slice • Song popularity. This test measures the model basis performance on both popular tracks and on songs with fewer listening events. The test is designed Models are tested to address a wide spectrum of known to address both robustness to long tail items and issues for recommender systems, for instance: fairness cold-start scenarios, so we pooled together both (e.g. a model should have equal outcomes for different less popular and newer songs. Again, we used groups, e.g. [28, 29, 30]), robustness (e.g. a model should logarithmic bucketing with base 10 to divide the produce good outcomes also for long-tail items, such as test set in order to avoid arbitrary thresholds. items with less history or belonging to less represented • User history. The test can be viewed as a categories, e.g. [31]), industry-specific use-cases (e.g. in robustness/cold-start test, in which we sliced the the case of music, your model should not consistently dataset based on the length of user history on the penalize niche or simply less known artists). platform. To create slices, we use the user play All the tests in this group are based on Miss Rate (MR), counts (i.e. the sum of play counts per user) and defined as ratio between the prediction errors (i.e. model we use logarithmic bucketing in base 10 to divide predictions do not contain the ground truth) and the the test set in order to avoid arbitrary thresholds. number of predictions. Slices can be generalized asn partitions (e.g. Countries with UK/US/IT/FR and others 3.2.3. Behavioral and qualitative tests is split is N partitions) of the test data forming n-ary classes. The absolute difference between the MR obtained Our final set of tests is behavioral in nature, and tries to capture (with some assumptions) how models differ 4 https://github.com/RecList/evalRS-CIKM-2022. based on qualitative aspects: • Be less wrong. It is important that RSs maintain get participants comfortable, through harmless a reasonable standard of relevance even when iterations, with the dataset and the multi-faceted the predictions are not accurate. For instance, nature of the challenge. if the ground truth for a recommendation is the 2. Second phase: after the organizers have evaluated rap song ‘Humble’ by Kendrick Lamar, a model the score distributions for individual tests, they might suggest another rap song from the same will attach different weights to each test to pro- year (‘The story of O.J.’ by Jay-Z), or a famous pop duce a balanced macro-score - i.e. if a test turns song from the top chart of that year (‘Shape of You’ out to be easy for most participants, its impor- by Ed Sheeran). There is still a substantial differ- tance will be counter-biased in the calculation. ence between these two as the first one is closer At the beginning of this phase, participants are to the ground truth than the second. Since this asked to update their evaluation script by cloning has a great impact on the overall user experience, again the data challenge repository: the purpose it is desirable that models test and measure their for each team becomes now leveraging the in- performance scenarios like the one just described. sights from the previous phase to optimize their We use the latent space of tracks to compute the models as much as possible for the leaderboard. average pairwise cosine distance between the em- Only scores obtained in this phase are considered beddings of the predicted items and the ground for the final prizes. truths. • Latent diversity: Diversity is closely tied with 3.3. Methodology the maximization of marginal relevance as a way to acknowledge uncertainty of user intent and Since the focus of the challenge is a popular public dataset, to address user utility in terms of discovery [36]. we implemented a robust evaluation procedure to avoid Diversity is often considered a partial proxy for data leakage and ensure fairness5 . Our protocol is split in fairness and it is an important measure of the per- two phases: local – when teams iterate on their solution formance of recommender systems in real world during the challenge - and remote – when organizers ver- scenarios [37]. We address diversity using the ify the submissions at the end and proclaim the winners: latent space of tracks testing for model density - where density is defined as the summation of • Local evaluation protocol: For each fold, the pro- the differences between each point in the predic- vided script first samples 25% of the users in the tion space and the mean of the prediction space. dataset. It then partitions the dataset into training Additionally, in order to account also for the “cor- and testing sets using the leave-one-out protocol: rectness” of prediction vectors, we calculate a the testing set comprises a list of unique users, bias defined as the distance between the ground where the target song for each of them has been truth vector and the mean of the prediction vec- picked randomly from their history. The train- tor and weight to penalize for high bias: the final ing set is the listening history for these sampled score is computed as 0.3 * diversity - 0.7 * bias, users with their test song removed. Participants’ where 0.3 and 0.7 are weights that we determined models will be trained and tuned based on their empirically to balance diversity and correctness. custom logic on the training set, and then evalu- ated over the test suite (Section 3.2) to provide a Please note that since we aim at widening the commu- final score for each run (Section 3.2.4); partition- nity contribution to testing, the final code submission for ing, training, testing, scoring will be done for a E v a l R S includes as a requirement that participants con- total of 4 repetitions: the average of the runs will tribute at least one custom test, by extending the provided constitute the leaderboard score. abstraction. • Remote evaluation protocol: the organizers will run the code submitted by participants, and re- 3.2.4. Final score peat the random evaluation loop. The scores thus obtained on the E v a l R S test suite will be compared Since each of the tests above return a score from a poten- with participants submissions as a sanity check tially unique, non-normal distribution, we need a way to (statistical comparison of means and 95% boot- define a macro-score for the leaderboard. To define the strapped CI). formula we adopt an empirical approach in two phases: 1. First phase: scores of individual tests are simply Thanks to the provided APIs, participants will be able averaged to get the leaderboard macro-score. The to run the full evaluation loop locally, as well as update purpose of this phase is to gather data on the rela- 5 To help participants with the implementation, we provide a template tive difficulty and utility of the different tests, and script that can be modified with custom model code. their leaderboard score automatically through the pro- is Adj. Professor of MLSys at NYU, publishes regularly in vided script. To ensure a fair and reproducible remote top-tier conferences (including NAACL, ACL, RecSys, SI- evaluation, final submission should contain a docker im- GIR), and is co-organizer of SIGIR eCom. Jacopo was the age that runs the local evaluation script and produces the lead organizer of the SIGIR Data Challenge 2021, spear- desired output within the maximum allotted time on the heading the release of the largest session-based dataset target cloud machine. Please check E v a l R S repository for for eCommerce research. the exact final requirements and up-to-date instructions. Federico Bianchi Federico Bianchi is a postdoctoral researcher at Stanford University. He obtained his Ph.D. 4. Organization, Community, in Computer Science at the University of Milano-Bicocca Impact in 2020. His research, ranging from Natural Language Processing methods for textual analytics to recommender 4.1. Structure and timeline systems for the e-commerce has been accepted to major NLP and AI conferences (EACL, NAACL, EMNLP, ACL, E v a l R S unfolds in three main phases: AAAI, RecSys) and journals (Cognitive Science, Applied 1. CHALLENGE: An open challenge phase, where Intelligence, Semantic Web Journals). He co-organized participating teams register for the challenge and the SIGIR Data Challenge 2021. He frequently releases his work on improving the scores on both standard research as open-source tools that have collected almost and behavioral metrics across the two phases ex- a thousand GitHub stars and been downloaded over 100 plained above (3.2.4). thousand times. 2. CFP: A call for papers, where teams submit a writ- ten contribution, describing their system, custom Tobias Schnabel Tobias Schnabel is a senior re- testing, data insights. searcher in the Productivity+Intelligence group at Mi- 3. CONFERENCE: At the conference, winners will crosoft Research. He is interested in improving human- be announced and special prizes for novel testings facing machine learning systems in an integrated way, and oustanding student work will be awarded. considering not only algorithmic but also human fac- During the workshop, we plan to discuss solicited tors. To this end, his research draws from causal in- papers and host a round-table with experts on RSs ference, reinforcement learning, machine learning, HCI, evaluation. and decision-making under uncertainty. He was a co- organizer for a WSDM workshop this year and has served Our CFP takes a “design paper” perspective, where as (senior) PC member for a wide array of AI and data teams are invited to discuss both how they adapted their science conference (ICML, NeurIPS, WSDM, KDD). Be- initial model to take into account the test suite, and how fore joining Microsoft, he obtained Ph.D. from the Com- the tests strengthened their understanding of the target puter Science Department at Cornell University under dataset and use case6 . Thorsten Joachims. We emphasize the CFP and CONFERENCE steps as mo- ments to share with the community additional tests, error Giuseppe Attanasio Giuseppe Attanasio is a postdoc- analysis and data insights inspired by E v a l R S . By leverag- toral researcher at Bocconi, where he works on large- ing RecList, we not only enable teams to quickly iterate scale neural architectures for Natural Language Process- starting from our ideas, but we promise to immediately ing. His research focuses on understanding and regular- circulate in the community their testing contribution izing models for debiasing and fairness purposes. His through a popular open source package. Finally, we plan research on the topic has been accepted to major NLP on using CEUR-WS to publish the accepted papers, as conferences (ACL). While working at Bocconi, he is con- well as drafting a final public report as an additional, cluding his Ph.D. at the Department of Control and Com- actionable artifacts from the challenge. puter Engineering at Politecnico di Torino. 4.2. Organizers Ciro Greco Ciro Greco was the co-founder and CEO of Jacopo Tagliabue Jacopo Tagliabue was co-founder Tooso, a San Francisco based startup specialized in Infor- of Tooso, an Information Retrieval company acquired by mation Retrieval. Tooso was acquired in 2019 by Coveo, Coveo in 2019. As Director of AI at Coveo, he divides his where he now works as VP or Artificial Intelligence.He time between product, research, and evangelization: he holds a Ph.D. in Linguistics and Cognitive Neuroscience at Milano-Bicocca. He worked as visiting scholar at MIT 6 As customary in these events, we will involve a small committee and as a post-doctoral fellow at Ghent University. He from top-tier practitioners and scholars to ensure the quality of the published extensively in top-tier conferences (including final submissions. NAACL, ACL, RecSys, SIGIR) and scientific journals (The the evaluation of RSs and fairness; second, researchers Linguistic Review, Cognitive Science, Nature Commu- who proposed a new model and desire to test its gener- nications). He was also co-organizer of the SIGIR Data alization abilities on new metrics; third, industrial prac- Challenge 2021. titioners that started using R e c L i s t after its release in recent months, and already signaled strong support for Gabriel de Souza P. Moreira Gabriel Moreira is a Sr. behavioral testing in their real-world use cases. Applied Research Scientist at NVIDIA, leading the re- E v a l R S makes a novel and significant contribution to search efforts of Merlin research team. He had his PhD the community: first, we ask practitioners to “live and degree from ITA university, Brazil, with a focus on Deep breath” the problem of evaluation, operationalizing prin- Learning for RecSys and Session-based recommendation. ciples and insights through sharable code; second, we Before joining NVIDIA, he was lead Data Scientist at embrace a “build in the open” approach, as all artifacts CI&T for 5 years, after working as software engineer for from the event will be available to the community as more than a decade. In 2019, he was recognized as a a permanent contribution, in the form of open source Google Developer Expert (GDE) for Machine Learning. code, design papers, and public documentation – through He was part of the NVIDIA teams that won recent Rec- prizes assigned based on scores, but also outstanding Sys competitions: ACM RecSys Challenge 2020, WSDM testing and paper contributions, and special awards for WebTour Workshop Challenge 2021 by Booking.com and students, we hope to actively encourage more practition- the SIGIR eCommerce Workshop Data Challenge 2021 ers to join the evaluation debate and get a more diverse by Coveo. set of perspectives for our workshop. As argued throughout this paper, when comparing Patrick John Chia Patrick John Chia is an Applied E v a l R S methodology to typical data challenges, we can Scientist at Coveo. Prior to this, he completed his Mas- summarize three important differentiating factors: first, ter’s degree at Imperial College London and spent a year we fight public leaderboard overfitting through our ran- at Massachusetts Institute of Technology (MIT). He was domized evaluation loop; second, we discourage complex co-organizer of the 2021 SIGIR Data Challenge and has solutions that cannot be practically used, as our open been a speaker on topics at the intersection of Machine source code competition provides a fixed (and reason- Learning and eCommerce (SIGIR eCom, ECNLP at ACL). able) compute budget; third and most importantly, with His latest interests lie in developing AI that has the ability a thorough evaluation with per-group and behavioral to learn like infants and applying it to creating solutions tests, we encourage participants to seek non-standard at Coveo. performance and discuss fairness implications. We strongly believe these points will lay down the foundation for a first-of-its-kind automatic, shared, iden- 5. Similar Events and Broader tifiable evaluation standard for RSs. Outlook The CIKM-related community has shown great interest 6. ACKNOWLEDGEMENTS in themes at the intersection of aligning machine learn- R e c L i s t is an open source library whose development is ing with human judgment, rigorous evaluation settings, supported by forward looking companies in the machine and fairness, as witnessed by popular Data Challenges learning community: the organizers wish to thank Comet, and important workshops in top-tier venues. Among Neptune, Gantry for their generous support.7 recent challenges, the 2021 SIGIR-Ecom Data Challenge, the 2021 Booking Data Challenge, and the 2020 RecSys Challenge are all events centered around the evaluation References of RSs, yet still substantially different: for example, the SIGIR Challenge focused on MRR as a success metric [10], [1] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, while the Booking Challenge [38] used top-k accuracy. B. Ko, Beyond NDCG: behavioral testing Moreover, the growing interest for rounded evaluation of recommender systems with reclist, CoRR led to the creation of many interesting workshops in re- abs/2111.09963 (2021). URL: https://arxiv.org/abs/ cent years, such as IntRS: Joint Workshop on Interfaces and 2111.09963. a r X i v : 2 1 1 1 . 0 9 9 6 3 . Human Decision Making for Recommender Systems, Im- [2] M. Schedl, The lfm-1b dataset for music re- pactRS: Workshop on the Impact of Recommender Systems trieval and recommendation, in: Proceedings of and FAccTRec: Workshop on Responsible Recommendation. the 2016 ACM on International Conference on For this reason, we expect this challenge to attract a di- Multimedia Retrieval, ICMR ’16, Association for verse set of practitioners: first, researchers interested in 7 Please check the project website for more details: https://reclist.io/. Computing Machinery, New York, NY, USA, 2016, [11] F. M. Harper, J. A. Konstan, The movielens datasets: p. 103–110. URL: https://doi.org/10.1145/2911996. History and context, ACM Trans. Interact. Intell. 2912004. doi:1 0 . 1 1 4 5 / 2 9 1 1 9 9 6 . 2 9 1 2 0 0 4 . Syst. 5 (2015). URL: https://doi.org/10.1145/2827872. [3] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we doi:1 0 . 1 1 4 5 / 2 8 2 7 8 7 2 . really making much progress? a worrying analy- [12] H. Zamani, M. Schedl, P. Lamere, C.-W. Chen, An sis of recent neural recommendation approaches, analysis of approaches taken in the acm recsys chal- in: Proceedings of the 13th ACM Conference on lenge 2018 for automatic music playlist continua- Recommender Systems, RecSys ’19, Association for tion, ACM Trans. Intell. Syst. Technol. 10 (2019). Computing Machinery, New York, NY, USA, 2019, URL: https://doi.org/10.1145/3344257. doi:1 0 . 1 1 4 5 / p. 101–109. URL: https://doi.org/10.1145/3298689. 3344257. 3347058. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 5 8 . [13] D. Kotkov, J. Veijalainen, S. Wang, Challenges of [4] Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang, serendipity in recommender systems, in: WEBIST, C. Geng, Are we evaluating rigorously? bench- 2016. marking recommendation for reproducible evalua- [14] D. Jannach, M. Ludewig, When recurrent neural tion and fair comparison, in: Fourteenth ACM con- networks meet the neighborhood for session-based ference on recommender systems, 2020, pp. 23–32. recommendation, in: Proceedings of the Eleventh [5] X. Wang, X. He, M. Wang, F. Feng, T.-S. Chua, Neu- ACM Conference on Recommender Systems, 2017, ral graph collaborative filtering, in: Proceedings of pp. 306–310. the 42nd international ACM SIGIR conference on [15] M. Ludewig, D. Jannach, Evaluation of session- Research and development in Information Retrieval, based recommendation algorithms, User Modeling 2019, pp. 165–174. and User-Adapted Interaction 28 (2018) 331–390. [6] A. Rashed, S. Jawed, L. Schmidt-Thieme, [16] M. T. Ribeiro, T. S. Wu, C. Guestrin, S. Singh, Be- A. Hintsches, Multirec: A multi-relational yond accuracy: Behavioral testing of nlp models approach for unique item recommendation in with checklist, in: ACL, 2020. auction systems, Fourteenth ACM Conference on [17] G. d. S. P. Moreira, S. Rabhi, J. M. Lee, R. Ak, Recommender Systems (2020). E. Oldridge, Transformers4rec: Bridging the gap [7] P. Kouki, I. Fountalis, N. Vasiloglou, X. Cui, E. Lib- between nlp and sequential/session-based recom- erty, K. Al Jadda, From the lab to production: A mendation, in: Fifteenth ACM Conference on Rec- case study of session-based recommendations in the ommender Systems, 2021, pp. 143–153. home-improvement domain, in: Fourteenth ACM [18] K. Ariu, N. Ryu, S. Yun, A. Proutière, Re- Conference on Recommender Systems, RecSys gret in online recommendation systems, ArXiv ’20, Association for Computing Machinery, New abs/2010.12363 (2020). York, NY, USA, 2020, p. 140–149. URL: https://doi. [19] J. Tagliabue, B. Yu, F. Bianchi, The Embeddings org/10.1145/3383313.3412235. doi:1 0 . 1 1 4 5 / 3 3 8 3 3 1 3 . That Came in From the Cold: Improving Vectors 3412235. for New and Rare Products with Content-Based [8] T. Moins, D. Aloise, S. J. Blanchard, Recseats: Inference, Association for Computing Machinery, A hybrid convolutional neural network choice New York, NY, USA, 2020, p. 577–578. URL: https: model for seat recommendations at reserved seat- //doi.org/10.1145/3383313.3411477. ing venues, in: Fourteenth ACM Conference on [20] L. Briand, G. Salha-Galvan, W. Bendada, M. Mor- Recommender Systems, RecSys ’20, Association for lon, V.-A. Tran, A semi-personalized system for Computing Machinery, New York, NY, USA, 2020, user cold start recommendation on music stream- p. 309–317. URL: https://doi.org/10.1145/3383313. ing apps, 2020. URL: arXiv:2106.03819. 3412263. doi:1 0 . 1 1 4 5 / 3 3 8 3 3 1 3 . 3 4 1 2 2 6 3 . [21] M. Hendriksen, E. Kuiper, P. Nauts, S. Schelter, [9] F. Bianchi, J. Tagliabue, B. Yu, Query2Prod2Vec: M. de Rijke, Analyzing and predicting purchase Grounded word embeddings for eCommerce, in: intent in e-commerce: Anonymous vs. identified Proceedings of the 2021 Conference of the North customers, 2020. URL: https://arxiv.org/abs/2012. American Chapter of the Association for Computa- 08777. tional Linguistics: Human Language Technologies: [22] Krista Garcia, The impact of product recommen- Industry Papers, Association for Computational dations, 2018. URL: https://www.emarketer.com/ Linguistics, Online, 2021, pp. 154–162. URL: https: content/the-impact-of-product-recommendations. //aclanthology.org/2021.naacl-industry.20. doi:1 0 . [23] A. Flexer, D. Schnitzer, J. Schlueter, A mirex meta- 18653/v1/2021.naacl- industry.20. analysis of hubness in audio music similarity, 2012. [10] J. Tagliabue, C. Greco, J.-F. Roy, F. Bianchi, G. Cas- [24] M. Twohey, G. J. Dance, Lawmakers press amazon sani, B. Yu, P. J. Chia, Sigir 2021 e-commerce work- on sales of chemical used in suicides, 2022. URL: shop data challenge, in: SIGIR eCom 2021, 2021. https://www.nytimes.com/2022/02/04/technology/ amazon-suicide-poison-preservative.html. mender systems, in: ACM WSDM Workshop on [25] J. Tagliabue, You Do Not Need a Bigger Boat: Rec- Web Tourism (WSDM WebTour’21), 2021. ommendations at Reasonable Scale in a (Mostly) Serverless and Open Stack, Association for Com- puting Machinery, New York, NY, USA, 2021, p. 598–600. URL: https://doi.org/10.1145/3460231. 3474604. [26] V. Batagelj, M. Zaveršnik, Generalized cores, Ad- vances in Data Analysis and Classification 5 (2011) 129–145. [27] M. Schedl, Investigating country-specific music preferences and music recommendation algorithms with the lfm-1b dataset, International Journal of Multimedia Information Retrieval 6 (2017) 71 – 84. [28] J. S. Ke Yang, Measuring fairness in ranked out- puts, in: SSDBM 2017: Proceedings of the 29th International Conference on Scientific and Statis- tical Database Management, 2017, pp. 1 – 6. URL: https://doi.org/10.1145/3085504.3085526. [29] C. Castillo, Fairness and transparency in rank- ing, in: ACM SIGIR ForumVolume, volume Volume 52, 2019, pp. 64 – 71. URL: https://doi.org/10.1145/ 3308774.3308783. [30] M. Zehlike, K. Yang, J. Stoyanovich, Fairness in ranking: A survey, in: TBD. ACM, 2020, pp. 1–58. URL: https://arxiv.org/pdf/2103.14000.pdf. [31] M. O’Mahony, N. Hurley, N. Kushmerick, G. Sil- vestre, Collaborative recommendation: A robust- ness analysis, volume 4, 2004. URL: https://doi.org/ 10.1145/1031114.1031116. [32] S. Saxena, S. Jain, Exploring and mitigating gender bias in recommender systems with explicit feedback (2021). URL: arXivpreprintarXiv:2112.02530. [33] D. Kowald, M. Schedl, E. Lex, The unfairness of popularity bias in music recommendation: A repro- ducibility study, European conference on informa- tion retrieval (2020). [34] Òscar Celma, P. Cano, From hits to niches? or how popular artists can bias music recommendation and discovery, in: Proceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition, 2008. URL: https://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.168.5009&rep=rep1&type=pdf. [35] P. Bello, D. Garcia, Cultural divergence in popular music: the increasing diversity of music consump- tion on spotify across countries, Humanities and Social Sciences Communications 8 (2021). [36] M. Drosou, H. Jagadish, E. Pitoura, J. Stoyanovich, Diversity in big data: A review, Big data 5.2 (2017) 73–84. [37] Diversity in recommender systems – a survey, Knowledge-Based Systems (2017) 154–162. [38] M. Baigorria Alonso, Data augmentation us- ing many-to-many rnns for session-aware recom-