1. Introduction

EvalRS: a rounded evaluation of recom mender systems

Jacopo Tagliabue

tagliabue.jacopo@gmail.com 1

Federico Bianchi

Tobias Schnabel

Giuseppe Attanasio

Ciro Greco

Gabriel de Souza P. Moreira

Patrick John Chia

2 0 Bocconi University , Milan , Italy 1 Coveo Labs , New York, NY , USA 2 Coveo , Montreal , Canada 3 Microsoft , Redmond, WA , USA 4 NVIDIA , São Paulo , Brazil 5 Stanford University , Stanford, CA , USA

Much of the complexity of recommender systems (RSs) comes from the fact that they are used as part of highly diverse real-world applications which requires them to deal with a wide array of user needs. However, research has focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow focus has limited the capacity of RSs to have a lasting impact in the real world and makes them vulnerable to undesired behavior, such as the reinforcement of data biases. We propose EvalRS as a new type of challenge, in order to foster this discussion among practitioners and build in the open new methodologies for testing RSs “in the wild”.

recommender systems behavioral testing open source

1. Introduction

Recommender systems (RSs) are embedded in most applications we use today. From streaming services to online retailers, the accuracy of a RS is a key factor in the success of many products. Evaluation of RSs has often been done considering point-wise metrics, such as HitRate (HR) or nDCG over held-out data points, but the field has recently begun to recognize the importance of a more rounded evaluation as a better proxy to real-world performance [ 1 ].

We designed E v a l R S as a new type of data challenge in which participants are asked to test their models incorporating quantitative as well as behavioral insights. Using a popular open dataset – Last.fm – we go beyond single aggregate numbers and instead require participants to optimize for a wide range of recommender systems properties. The contribution of this challenge is two-fold: EvalRS 2022: CIKM EvalRS 2022 DataChallenge, October 21, 2022, FB towards a first draft. PC led the implementation and contributed most of the RecList code. GA, CG, FB and PC researched, iterated and operationalized behavioral tests. GM reviewed the API and participants. Everybody helped with drafting the paper, rules and guidelines. JT and FB acted as senior PIs in the project. JT and CG 2.

Motivation E v a l R S at CIKM 2022 complements the existing challenge

landscape and it is driven by two diferent perspectives: the first one coming from academic research, the second one from the industrial development of RSs. We

1https://github.com/RecList/evalRS-CIKM-2022. 2.1. A Research Perspective Although undeniable progress was made in the past years,

concerns have been raised about the status of research advancements in the field of recommendations, particularly with respect to ephemeral processes in motivating architectural choices and lack of reproducibility [3]. This challenge draws attention to a further – and potentially deeper – issue: even if the “reproducibility crisis” is solved, we are still mostly dealing with point-wise quantitative metrics as the only benchmarks for RSs. As reported by Sun et al. [4], the dominating metrics used in the evaluation of recommender systems published at top-tier conferences (RecSys, SIGIR, CIKM) are standard information retrieval metrics, such as MRR, Recall, HITS, NDCG [5, 6, 7, 8, 9].

While it is undoubtedly convenient to summarize the performance of diferent models via one score, this lossy projection discards a lot of important information on model behavior: for example, given the power-law distribution in many real-world datasets ([10, 11, 12]), marginal improvements on frequent items may translate in noticeable accuracy gains, even at the cost of significantly degrading the experience of subgroups. Metrics such as coverage, serendipity, and bias [13, 14, 15] are a ifrst step in the right direction, but they still fall short of capturing the full complexity of deploying RSs.

Following the pioneering work of [16] in Natural Language Processing, we propose to supplement standard retrieval metrics with new tests: in particular, we encourage practitioners to go beyond the false dichotomy “quantitative-and-automated” vs “qualitativeand-manual”, and find a middle ground in which behavioral desiderata can be expressed transparently in code [ 1 ].

2.2. An Industrial Perspective RSs in practice difer from RSs used in research in crucial

ways. For example, in research, a static dataset is used repeatedly, and there is no real interactivity between the model and users: prediction over a given point in time in the test set doesn’t change what happens at +1 2. Even without considering the complexity of reproducing real-world interactions for benchmarking purposes, we highlight four important themes from our experience in building RSs at scale in production scenarios: • Cold-start performance: new/rare items and users are challenging for many models across industries [19, 20]. In e-commerce, for instance, while most “similar products” predictions will happen 2This is especially important in the context of sequential recommender[17], which arguably resembles more reinforcement learning than supervised inference with pseudo-feedback [18]. over frequent items, in reality, new users and items can represent a big portion of them with significant business consequences: the cold-start problem is believed to afect 50% of users [ 21] in a context where field studies found that 40% of shoppers would stop shopping if shown nonrelevant recommendations [22]. • Use cases and industry idiosyncrasies: diferent use cases in diferent industries present diferent challenges. For instance, recommendations for complementary items in e-commerce need to account for the fact that if item A is a good complementary candidate for item B, the reverse might not hold (e.g. an HDMI cable is a good complementary item for a 4k TV, but not vice versa). Music recommendations need to deal with the issue of “hubness”, where popular items act as hubs in the top-N recommendation list of many users without being similar to the users’ profiles and making other items invisible to the recommender [23]. Such use-case specific traits are particularly important when designing efective testing procedures and often require considerable domain knowledge. • Not all mistakes are equal: point-wise metrics are unable to distinguish diferent types of mistakes; this is especially problematic for recommender systems, as even a single mistake may cause great social and reputational damage [24]. • Robustness matters as much as accuracy: while historically a significant part of industry efort can be traced back to a few key players, there is a blooming market of Recommendation-as-a-Service systems designed to address the needs of “reasonable scale” systems [25]. Instead of vertical scaling and extreme optimization, SaaS providers emphasize horizontal scaling through multiple deployments, highlighting the importance of models that prove to be flexible and robust across many dimensions (e.g., trafic, industry, etc.).

While not related to model evaluation per se, decisionmaking processes in the real world would also take into account the diferent resources used by competing approaches: time (both as time for training and latency for serving), computing (CPU vs GPU), CO2 emissions are all typically included in an industry benchmark.

3. EvalRS Challenge

We propose to supplement standard retrieval metrics over held out data points with behavioral tests: in behavioral tests, we treat the target model as a black-box and supply only input-output pairs (for example, query user and desired recommended song). In particular, we leverage a recent open-source package, R e c L i s t [ 1 ], to prepare a suite of tests for our target dataset (Section 3.1). In putting forward our tests, we operationalize the intuitions from Section 2 through a general plug-andplay API to facilitate model comparison and data preparation, and by providing convenient abstractions and ready-made recommenders used as baselines.

3.1. Use Case and Dataset

E v a l R S is a user-item recommendation challenge in the music domain: participants are asked to train a model • We then performed another iteration of k-core that, given a user id, recommends an appropriate song out ifltering, this time on the u s e r - t r a c k interaction of a known set of songs. The ground truth necessary to graph, with = 10 to retain only users and tracks compute all the test metrics, quantitative and behavioral, which are informative. is provided by our leave-one-out framework: for each • Lastly, the original dataset contained missing user, we remove a song from their listening history and meta-data (e.g. there were t r a c k _ i d in the events use it as the ground truth when evaluating the models. data which did not have corresponding track

We provide test abstractions and an evaluation script metadata). We removed tracks, albums, artists designed for LFM, a transformed version of LFM-1b and events which had missing information. dataset [ 2 ] – a dataset focused on music consumption on • We summarize the final dataset statistics in Table Last.fm. We chose the LFM-1b dataset as the primary data 1. source after a thorough comparisons of popular datasets for a unique combination of features. Given our focus Taken together, these features allow us to fulfill E v a l R S on rounded evaluation and the importance of joining promise of ofering a challenging setting and a rounded prediction / ground truth with meta-data, LFM is an ideal evaluation. While a clear motivation behind the release of dataset, as it provides rich song (artist, album informa- LFM-1b dataset was to ofer “additional user descriptors tion) and user (country, age, gender,3 time on platform) that reflect their music taste and consumption behavior”, meta-data. it is telling that both the modelling and the evaluation

We applied principled data transformations to make by the original authors are still performed without any E v a l R S amenable to a larger audience whilst preserving real use of these rich meta-data [27]. By taking a fresh the rich information in the original dataset. We detail look at an existing, popular dataset, E v a l R S challenges the data transformation process and our motivations: practitioners to think about models not just along familiar quantitative dimensions, but also along non-standard scores closer to human perception of relevance and fairness. • First, we removed u s e r s and a r t i s t s which have few interaction since they are likely to be too sparse to be informative. Following the suggestions in, we apply k-core [26] filtering to the bipar- 3.2. Evaluation Metrics tite interaction graph between u s e r s and a r t i s t s , setting = 10 (i.e. we retain vertices with a mini- Submission are evaluated according to our randomized mum degree of k). loop (Section 3.3) over the testing suite released with the • After the aforementioned processing, the dataset challenge. At a first glance, tests can be roughly divided still contained over 900M events, which moti- in three main groups: vated further filtering of the data. In particular, we keep only the first interaction a u s e r had with a given t r a c k , and for each u s e r we retain only their = 500 most recent unique t r a c k interactions. We supplement the information lost during this pruning step by providing the interaction count between a u s e r and a t r a c k .

• Standard RSs metrics: these are the typical point-wise metrics used in the field (e.g. MRR, HR@K) – they are included as sanity checks and as a informative baseline against which insights gained through the behavioral tests can be interpreted. • Standard metrics on a per-group or slice basis: as shown for example in [ 1 ], models which are indistinguishable on the full test set may exhibit very diferent behavior across data slices.

3Gender in the original dataset is a binary variable. This is a limita

tion, as it gives a stereotyped representation of gender. Our intent is not to make normative claims about gender.

It is therefore crucial to quantify model perfor- on each slice and the the MR obtained on the original mance for specific input and target groups, i.e. test set is averaged and negated (so that a higher value is there a performance diference between males implies better performance in the metric) to obtain the and females? Is there an accuracy drops when final score for each test. The slice-based tests considered artists are not very popular? for the final scores are: • Behavioral tests: this group may include perturbance tests (i.e. if we modify a user’s history by swapping Metallica with Pantera, how much will predictions change?), and error distance tests (i.e. if the ground truth is Shine On You Crazy Diamond and the prediction is Smoke on the Water, how severe is this error?).

Based on this taxonomy, we now survey the tests im

plemented in the R e c L i s t powering E v a l R S , with reference to relevant literature and examples from the target datasets. For implementation details please refer to the oficial repository. 4 3.2.1. Standard RSs metrics Based on popular metrics in the literature, we picked two standard metrics as a quantitative baseline and sanity check for our R e c L i s t : • Mean Reciprocal Rank (MRR) as a measure of where the first relevant element retrieved by the model is ranked in the output list. Besides being considered a standard rank-aware evaluation metric, we chose MRR because it is particularly simple to compute and to interpret. • Hit Rate (HR), defined as Recall at k ( = 100 ), i.e. the proportion of relevant items found in the top-k recommendation. 3.2.2. Standard metrics on a per-group or slice

basis Models are tested to address a wide spectrum of known issues for recommender systems, for instance: fairness (e.g. a model should have equal outcomes for diferent groups, e.g. [28, 29, 30]), robustness (e.g. a model should produce good outcomes also for long-tail items, such as items with less history or belonging to less represented categories, e.g. [31]), industry-specific use-cases (e.g. in the case of music, your model should not consistently penalize niche or simply less known artists).

All the tests in this group are based on Miss Rate (MR), defined as ratio between the prediction errors (i.e. model predictions do not contain the ground truth) and the number of predictions. Slices can be generalized asn partitions (e.g. Countries with UK/US/IT/FR and others is split is N partitions) of the test data forming n-ary classes. The absolute diference between the MR obtained 4https://github.com/RecList/evalRS-CIKM-2022. • Gender balance. This test is meant to address fairness towards gender [32]. Since the dataset only provides binary gender, the test will minimize the diference between the MR obtained on users who specified Female as gender and the MR obtained on the original test set. In other words, the smaller the diference, the fairer the model towards potential gender biases. • Artist popularity. This test is meant to address a known problem in music recommendations: niche (or simply less known) artists and users who are less interested in highly popular content are often penalized by recommender systems [33, 34]. This point appears even more important when we consider that several music streaming services (e.g. Spotify, Tidal) also act as marketplaces for artists to promote their music. Since splitting the test set in two would draw an arbitrary line between popular vs. unpopular artists, failing to capture the actual properties of the distribution. Instead, we split the test set into bins with equal size after logarithmic scaling. • User country. Music consumption is subject to many country dependent factors, such as language diferences, local sub-genres and styles, local licensing and distribution laws, cultural inlfuences of local traditional music, etc [ 35]. We capture this diversity by slicing the test set based on the top-10 countries by user counts. • Song popularity. This test measures the model performance on both popular tracks and on songs with fewer listening events. The test is designed to address both robustness to long tail items and cold-start scenarios, so we pooled together both less popular and newer songs. Again, we used logarithmic bucketing with base 10 to divide the test set in order to avoid arbitrary thresholds. • User history. The test can be viewed as a robustness/cold-start test, in which we sliced the dataset based on the length of user history on the platform. To create slices, we use the user play counts (i.e. the sum of play counts per user) and we use logarithmic bucketing in base 10 to divide the test set in order to avoid arbitrary thresholds. 3.2.3. Behavioral and qualitative tests Our final set of tests is behavioral in nature, and tries to capture (with some assumptions) how models difer based on qualitative aspects: • Be less wrong. It is important that RSs maintain get participants comfortable, through harmless a reasonable standard of relevance even when iterations, with the dataset and the multi-faceted the predictions are not accurate. For instance, nature of the challenge. if the ground truth for a recommendation is the 2. Second phase: after the organizers have evaluated rap song ‘Humble’ by Kendrick Lamar, a model the score distributions for individual tests, they might suggest another rap song from the same will attach diferent weights to each test to proyear (‘The story of O.J.’ by Jay-Z), or a famous pop duce a balanced macro-score - i.e. if a test turns song from the top chart of that year (‘Shape of You’ out to be easy for most participants, its imporby Ed Sheeran). There is still a substantial difer- tance will be counter-biased in the calculation. ence between these two as the first one is closer At the beginning of this phase, participants are to the ground truth than the second. Since this asked to update their evaluation script by cloning has a great impact on the overall user experience, again the data challenge repository: the purpose it is desirable that models test and measure their for each team becomes now leveraging the inperformance scenarios like the one just described. sights from the previous phase to optimize their We use the latent space of tracks to compute the models as much as possible for the leaderboard. average pairwise cosine distance between the em- Only scores obtained in this phase are considered beddings of the predicted items and the ground for the final prizes .

truths. • Latent diversity: Diversity is closely tied with 3.3. Methodology the maximization of marginal relevance as a way to acknowledge uncertainty of user intent and Since the focus of the challenge is a popular public dataset, to address user utility in terms of discovery [36]. we implemented a robust evaluation procedure to avoid Diversity is often considered a partial proxy for data leakage and ensure fairness5. Our protocol is split in fairness and it is an important measure of the per- two phases: local – when teams iterate on their solution formance of recommender systems in real world during the challenge - and remote – when organizers verscenarios [37]. We address diversity using the ify the submissions at the end and proclaim the winners: latent space of tracks testing for model density - where density is defined as the summation of the diferences between each point in the prediction space and the mean of the prediction space.

Additionally, in order to account also for the “correctness” of prediction vectors, we calculate a bias defined as the distance between the ground truth vector and the mean of the prediction vector and weight to penalize for high bias: the final score is computed as 0.3 * diversity - 0.7 * bias, where 0.3 and 0.7 are weights that we determined empirically to balance diversity and correctness.

Please note that since we aim at widening the commu

nity contribution to testing, the final code submission for E v a l R S includes as a requirement that participants contribute at least one custom test, by extending the provided abstraction. 3.2.4. Final score Since each of the tests above return a score from a potentially unique, non-normal distribution, we need a way to define a macro-score for the leaderboard. To define the formula we adopt an empirical approach in two phases: 1. First phase: scores of individual tests are simply averaged to get the leaderboard macro-score. The • Local evaluation protocol: For each fold, the provided script first samples 25% of the users in the dataset. It then partitions the dataset into training and testing sets using the leave-one-out protocol: the testing set comprises a list of unique users, where the target song for each of them has been picked randomly from their history. The training set is the listening history for these sampled users with their test song removed. Participants’ models will be trained and tuned based on their custom logic on the training set, and then evaluated over the test suite (Section 3.2) to provide a ifnal score for each run (Section 3.2.4); partitioning, training, testing, scoring will be done for a total of 4 repetitions: the average of the runs will constitute the leaderboard score. • Remote evaluation protocol: the organizers will run the code submitted by participants, and repeat the random evaluation loop. The scores thus obtained on the E v a l R S test suite will be compared with participants submissions as a sanity check (statistical comparison of means and 95% bootstrapped CI).

Thanks to the provided APIs, participants will be able to run the full evaluation loop locally, as well as update purpose of this phase is to gather data on the rela- 5To help participants with the implementation, we provide a template tive dificulty and utility of the diferent tests, and script that can be modified with custom model code. their leaderboard score automatically through the pro- is Adj. Professor of MLSys at NYU, publishes regularly in vided script. To ensure a fair and reproducible remote top-tier conferences (including NAACL, ACL, RecSys, SIevaluation, final submission should contain a docker im- GIR), and is co-organizer of SIGIR eCom. Jacopo was the age that runs the local evaluation script and produces the lead organizer of the SIGIR Data Challenge 2021, speardesired output within the maximum allotted time on the heading the release of the largest session-based dataset target cloud machine. Please check E v a l R S repository for for eCommerce research. the exact final requirements and up-to-date instructions.

4.1. Structure and timeline E v a l R S unfolds in three main phases: 4. Organization, Community, Impact Federico Bianchi Federico Bianchi is a postdoctoral

researcher at Stanford University. He obtained his Ph.D. in Computer Science at the University of Milano-Bicocca in 2020. His research, ranging from Natural Language Processing methods for textual analytics to recommender systems for the e-commerce has been accepted to major NLP and AI conferences (EACL, NAACL, EMNLP, ACL,

AAAI, RecSys) and journals (Cognitive Science, Applied 1. CHALLENGE: An open challenge phase, where Intelligence, Semantic Web Journals). He co-organized participating teams register for the challenge and the SIGIR Data Challenge 2021. He frequently releases his work on improving the scores on both standard research as open-source tools that have collected almost and behavioral metrics across the two phases ex- a thousand GitHub stars and been downloaded over 100 plained above (3.2.4). thousand times. 2. CFP: A call for papers, where teams submit a written contribution, describing their system, custom Tobias Schnabel Tobias Schnabel is a senior retesting, data insights. searcher in the Productivity+Intelligence group at Mi3. CONFERENCE: At the conference, winners will crosoft Research. He is interested in improving humanbe announced and special prizes for novel testings facing machine learning systems in an integrated way, and oustanding student work will be awarded. considering not only algorithmic but also human facDuring the workshop, we plan to discuss solicited tors. To this end, his research draws from causal inpapers and host a round-table with experts on RSs ference, reinforcement learning, machine learning, HCI, evaluation. and decision-making under uncertainty. He was a coorganizer for a WSDM workshop this year and has served as (senior) PC member for a wide array of AI and data science conference (ICML, NeurIPS, WSDM, KDD). Before joining Microsoft, he obtained Ph.D. from the Computer Science Department at Cornell University under Thorsten Joachims.

Giuseppe Attanasio Giuseppe Attanasio is a postdoctoral researcher at Bocconi, where he works on largescale neural architectures for Natural Language Processing. His research focuses on understanding and regularizing models for debiasing and fairness purposes. His research on the topic has been accepted to major NLP conferences (ACL). While working at Bocconi, he is concluding his Ph.D. at the Department of Control and Computer Engineering at Politecnico di Torino.

Ciro Greco Ciro Greco was the co-founder and CEO of

Tooso, a San Francisco based startup specialized in Information Retrieval. Tooso was acquired in 2019 by Coveo, where he now works as VP or Artificial Intelligence.He holds a Ph.D. in Linguistics and Cognitive Neuroscience at Milano-Bicocca. He worked as visiting scholar at MIT and as a post-doctoral fellow at Ghent University. He published extensively in top-tier conferences (including

Our CFP takes a “design paper” perspective, where

teams are invited to discuss both how they adapted their initial model to take into account the test suite, and how the tests strengthened their understanding of the target dataset and use case6.

We emphasize the CFP and CONFERENCE steps as moments to share with the community additional tests, error analysis and data insights inspired by E v a l R S . By leveraging RecList, we not only enable teams to quickly iterate starting from our ideas, but we promise to immediately circulate in the community their testing contribution through a popular open source package. Finally, we plan on using CEUR-WS to publish the accepted papers, as well as drafting a final public report as an additional, actionable artifacts from the challenge.

4.2. Organizers Jacopo Tagliabue Jacopo Tagliabue was co-founder

of Tooso, an Information Retrieval company acquired by Coveo in 2019. As Director of AI at Coveo, he divides his time between product, research, and evangelization: he

6As customary in these events, we will involve a small committee

from top-tier practitioners and scholars to ensure the quality of the ifnal submissions.

NAACL, ACL, RecSys, SIGIR) and scientific journals (The the evaluation of RSs and fairness; second, researchers Linguistic Review, Cognitive Science, Nature Commu- who proposed a new model and desire to test its genernications). He was also co-organizer of the SIGIR Data alization abilities on new metrics; third, industrial pracChallenge 2021. titioners that started using R e c L i s t after its release in recent months, and already signaled strong support for Gabriel de Souza P. Moreira Gabriel Moreira is a Sr. behavioral testing in their real-world use cases. Applied Research Scientist at NVIDIA, leading the re- E v a l R S makes a novel and significant contribution to search eforts of Merlin research team. He had his PhD the community: first, we ask practitioners to “live and degree from ITA university, Brazil, with a focus on Deep breath” the problem of evaluation, operationalizing prinLearning for RecSys and Session-based recommendation. ciples and insights through sharable code; second, we Before joining NVIDIA, he was lead Data Scientist at embrace a “build in the open” approach, as all artifacts CI&T for 5 years, after working as software engineer for from the event will be available to the community as more than a decade. In 2019, he was recognized as a a permanent contribution, in the form of open source Google Developer Expert (GDE) for Machine Learning. code, design papers, and public documentation – through He was part of the NVIDIA teams that won recent Rec- prizes assigned based on scores, but also outstanding Sys competitions: ACM RecSys Challenge 2020, WSDM testing and paper contributions, and special awards for WebTour Workshop Challenge 2021 by Booking.com and students, we hope to actively encourage more practitionthe SIGIR eCommerce Workshop Data Challenge 2021 ers to join the evaluation debate and get a more diverse by Coveo. set of perspectives for our workshop. As argued throughout this paper, when comparing Patrick John Chia Patrick John Chia is an Applied E v a l R S methodology to typical data challenges, we can Scientist at Coveo. Prior to this, he completed his Mas- summarize three important diferentiating factors: first , ter’s degree at Imperial College London and spent a year we fight public leaderboard overfitting through our ranat Massachusetts Institute of Technology (MIT). He was domized evaluation loop; second, we discourage complex co-organizer of the 2021 SIGIR Data Challenge and has solutions that cannot be practically used, as our open been a speaker on topics at the intersection of Machine source code competition provides a fixed (and reasonLearning and eCommerce (SIGIR eCom, ECNLP at ACL). able) compute budget; third and most importantly, with His latest interests lie in developing AI that has the ability a thorough evaluation with per-group and behavioral to learn like infants and applying it to creating solutions tests, we encourage participants to seek non-standard at Coveo. performance and discuss fairness implications.

We strongly believe these points will lay down the foundation for a first-of-its-kind automatic, shared, identifiable evaluation standard for RSs.

5. Similar Events and Broader Outlook

The CIKM-related community has shown great interest in themes at the intersection of aligning machine learning with human judgment, rigorous evaluation settings, and fairness, as witnessed by popular Data Challenges and important workshops in top-tier venues. Among recent challenges, the 2021 SIGIR-Ecom Data Challenge, the 2021 Booking Data Challenge, and the 2020 RecSys Challenge are all events centered around the evaluation of RSs, yet still substantially diferent: for example, the SIGIR Challenge focused on MRR as a success metric [10], while the Booking Challenge [38] used top-k accuracy.

Moreover, the growing interest for rounded evaluation led to the creation of many interesting workshops in recent years, such as IntRS: Joint Workshop on Interfaces and Human Decision Making for Recommender Systems, ImpactRS: Workshop on the Impact of Recommender Systems and FAccTRec: Workshop on Responsible Recommendation. For this reason, we expect this challenge to attract a diverse set of practitioners: first, researchers interested in

6. ACKNOWLEDGEMENTS R e c L i s t is an open source library whose development is

supported by forward looking companies in the machine learning community: the organizers wish to thank Comet, Neptune, Gantry for their generous support.7

7Please check the project website for more details: https://reclist.io/.

Computing Machinery, New York, NY, USA, 2016, [11] F. M. Harper, J. A. Konstan, The movielens datasets: p. 103–110. URL: https://doi.org/10.1145/2911996. History and context, ACM Trans. Interact. Intell. 2912004. doi:1 0 . 1 1 4 5 / 2 9 1 1 9 9 6 . 2 9 1 2 0 0 4 . Syst. 5 (2015). URL: https://doi.org/10.1145/2827872. [3] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we doi:1 0 . 1 1 4 5 / 2 8 2 7 8 7 2 .

really making much progress? a worrying analy- [12] H. Zamani, M. Schedl, P. Lamere, C.-W. Chen, An sis of recent neural recommendation approaches, analysis of approaches taken in the acm recsys chalin: Proceedings of the 13th ACM Conference on lenge 2018 for automatic music playlist continuaRecommender Systems, RecSys ’19, Association for tion, ACM Trans. Intell. Syst. Technol. 10 (2019). Computing Machinery, New York, NY, USA, 2019, URL: https://doi.org/10.1145/3344257. doi:1 0 . 1 1 4 5 / p. 101–109. URL: https://doi.org/10.1145/3298689. 3 3 4 4 2 5 7 .

3347058. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 5 8 . [13] D. Kotkov, J. Veijalainen, S. Wang, Challenges of [4] Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang, serendipity in recommender systems, in: WEBIST, C. Geng, Are we evaluating rigorously? bench- 2016. marking recommendation for reproducible evalua- [14] D. Jannach, M. Ludewig, When recurrent neural tion and fair comparison, in: Fourteenth ACM con- networks meet the neighborhood for session-based ference on recommender systems, 2020, pp. 23–32. recommendation, in: Proceedings of the Eleventh [5] X. Wang, X. He, M. Wang, F. Feng, T.-S. Chua, Neu- ACM Conference on Recommender Systems, 2017, ral graph collaborative filtering, in: Proceedings of pp. 306–310. the 42nd international ACM SIGIR conference on [15] M. Ludewig, D. Jannach, Evaluation of sessionResearch and development in Information Retrieval, based recommendation algorithms, User Modeling 2019, pp. 165–174. and User-Adapted Interaction 28 (2018) 331–390. [6] A. Rashed, S. Jawed, L. Schmidt-Thieme, [16] M. T. Ribeiro, T. S. Wu, C. Guestrin, S. Singh, BeA. Hintsches, Multirec: A multi-relational yond accuracy: Behavioral testing of nlp models approach for unique item recommendation in with checklist, in: ACL, 2020. auction systems, Fourteenth ACM Conference on [17] G. d. S. P. Moreira, S. Rabhi, J. M. Lee, R. Ak, Recommender Systems (2020). E. Oldridge, Transformers4rec: Bridging the gap [7] P. Kouki, I. Fountalis, N. Vasiloglou, X. Cui, E. Lib- between nlp and sequential/session-based recomerty, K. Al Jadda, From the lab to production: A mendation, in: Fifteenth ACM Conference on Reccase study of session-based recommendations in the ommender Systems, 2021, pp. 143–153. home-improvement domain, in: Fourteenth ACM [18] K. Ariu, N. Ryu, S. Yun, A. Proutière, ReConference on Recommender Systems, RecSys gret in online recommendation systems, ArXiv ’20, Association for Computing Machinery, New abs/2010.12363 (2020).

York, NY, USA, 2020, p. 140–149. URL: https://doi. [19] J. Tagliabue, B. Yu, F. Bianchi, The Embeddings org/10.1145/3383313.3412235. doi:1 0 . 1 1 4 5 / 3 3 8 3 3 1 3 . That Came in From the Cold: Improving Vectors 3 4 1 2 2 3 5 . for New and Rare Products with Content-Based [8] T. Moins, D. Aloise, S. J. Blanchard, Recseats: Inference, Association for Computing Machinery, A hybrid convolutional neural network choice New York, NY, USA, 2020, p. 577–578. URL: https: model for seat recommendations at reserved seat- //doi.org/10.1145/3383313.3411477. ing venues, in: Fourteenth ACM Conference on [20] L. Briand, G. Salha-Galvan, W. Bendada, M. MorRecommender Systems, RecSys ’20, Association for lon, V.-A. Tran, A semi-personalized system for Computing Machinery, New York, NY, USA, 2020, user cold start recommendation on music streamp. 309–317. URL: https://doi.org/10.1145/3383313. ing apps, 2020. URL: arXiv:2106.03819. 3412263. doi:1 0 . 1 1 4 5 / 3 3 8 3 3 1 3 . 3 4 1 2 2 6 3 . [21] M. Hendriksen, E. Kuiper, P. Nauts, S. Schelter, [9] F. Bianchi, J. Tagliabue, B. Yu, Query2Prod2Vec: M. de Rijke, Analyzing and predicting purchase Grounded word embeddings for eCommerce, in: intent in e-commerce: Anonymous vs. identified Proceedings of the 2021 Conference of the North customers, 2020. URL: https://arxiv.org/abs/2012. American Chapter of the Association for Computa- 08777. tional Linguistics: Human Language Technologies: [22] Krista Garcia, The impact of product recommenIndustry Papers, Association for Computational dations, 2018. URL: https://www.emarketer.com/ Linguistics, Online, 2021, pp. 154–162. URL: https: content/the-impact-of-product-recommendations. //aclanthology.org/2021.naacl-industry.20. doi:1 0 . [23] A. Flexer, D. Schnitzer, J. Schlueter, A mirex meta1 8 6 5 3 / v 1 / 2 0 2 1 . n a a c l - i n d u s t r y . 2 0 . analysis of hubness in audio music similarity, 2012. [10] J. Tagliabue, C. Greco, J.-F. Roy, F. Bianchi, G. Cas- [24] M. Twohey, G. J. Dance, Lawmakers press amazon sani, B. Yu, P. J. Chia, Sigir 2021 e-commerce work- on sales of chemical used in suicides, 2022. URL: shop data challenge, in: SIGIR eCom 2021, 2021. https://www.nytimes.com/2022/02/04/technology/ amazon-suicide-poison-preservative.html. [25] J. Tagliabue, You Do Not Need a Bigger Boat: Recommendations at Reasonable Scale in a (Mostly) Serverless and Open Stack, Association for Computing Machinery, New York, NY, USA, 2021, p. 598–600. URL: https://doi.org/10.1145/3460231.

3474604. [26] V. Batagelj, M. Zaveršnik, Generalized cores, Advances in Data Analysis and Classification 5 (2011) 129–145. [27] M. Schedl, Investigating country-specific music preferences and music recommendation algorithms with the lfm-1b dataset, International Journal of

Multimedia Information Retrieval 6 (2017) 71 – 84. [28] J. S. Ke Yang, Measuring fairness in ranked outputs, in: SSDBM 2017: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, 2017, pp. 1 – 6. URL: https://doi.org/10.1145/3085504.3085526. [29] C. Castillo, Fairness and transparency in ranking, in: ACM SIGIR ForumVolume, volume Volume 52, 2019, pp. 64 – 71. URL: https://doi.org/10.1145/ 3308774.3308783. [30] M. Zehlike, K. Yang, J. Stoyanovich, Fairness in ranking: A survey, in: TBD. ACM, 2020, pp. 1–58.

URL: https://arxiv.org/pdf/2103.14000.pdf. [31] M. O’Mahony, N. Hurley, N. Kushmerick, G. Silvestre, Collaborative recommendation: A robustness analysis, volume 4, 2004. URL: https://doi.org/ 10.1145/1031114.1031116. [32] S. Saxena, S. Jain, Exploring and mitigating gender bias in recommender systems with explicit feedback (2021). URL: arXivpreprintarXiv:2112.02530. [33] D. Kowald, M. Schedl, E. Lex, The unfairness of popularity bias in music recommendation: A reproducibility study, European conference on information retrieval (2020). [34] Òscar Celma, P. Cano, From hits to niches? or how popular artists can bias music recommendation and discovery, in: Proceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition, 2008. URL: https://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.168.5009&rep=rep1&type=pdf. [35] P. Bello, D. Garcia, Cultural divergence in popular music: the increasing diversity of music consumption on spotify across countries, Humanities and

Social Sciences Communications 8 (2021). [36] M. Drosou, H. Jagadish, E. Pitoura, J. Stoyanovich,

Diversity in big data: A review, Big data 5.2 (2017) 73–84. [37] Diversity in recommender systems – a survey,

Knowledge-Based Systems (2017) 154–162. [38] M. Baigorria Alonso, Data augmentation using many-to-many rnns for session-aware recommender systems, in: ACM WSDM Workshop on Web Tourism (WSDM WebTour’21), 2021.

[1]

P. J.

Chia ,

Tagliabue ,

Bianchi ,

He ,

Ko , Beyond

NDCG

: behavioral testing of recommender systems with reclist , CoRR abs/2111 .09963 ( 2021 ). URL: https://arxiv.org/abs/ 2111.09963. a r X i v : 2 1 1 1 . 0 9 9 6 3 .

[2]

Schedl , The lfm-1b dataset for music retrieval and recommendation , in: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval , ICMR '16, Association for