EvalRS: a rounded evaluation of recommender systems
Jacopo Tagliabue1,2,∗,† , Federico Bianchi3,† , Tobias Schnabel4,† , Giuseppe Attanasio5,† ,
Ciro Greco1,2,† , Gabriel de Souza P. Moreira6,† and Patrick John Chia7,†
1
  South Park Commons, New York, NY, USA
2
  Coveo Labs, New York, NY, USA
3
  Stanford University, Stanford, CA, USA
4
  Microsoft, Redmond, WA, USA
5
  Bocconi University, Milan, Italy
6
  NVIDIA, São Paulo, Brazil
7
  Coveo, Montreal, Canada


                                             Abstract
                                            Much of the complexity of recommender systems (RSs) comes from the fact that they are used as part of highly diverse
                                            real-world applications which requires them to deal with a wide array of user needs. However, research has focused almost
                                            exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS
                                            behavior in real-world scenarios. Such narrow focus has limited the capacity of RSs to have a lasting impact in the real world
                                            and makes them vulnerable to undesired behavior, such as the reinforcement of data biases. We propose E v a l R S as a new type
                                            of challenge, in order to foster this discussion among practitioners and build in the open new methodologies for testing RSs
                                            “in the wild”.

                                             Keywords
                                             recommender systems, behavioral testing, open source


1. Introduction                                                                                                                              1. We propose and standardize the data, evaluation
                                                                                                                                                loop and testing for RSs over a popular use case
Recommender systems (RSs) are embedded in most appli-                                                                                           (user-item recommendations for music consump-
cations we use today. From streaming services to online                                                                                         tion [2]), thus releasing in the open domain a first
retailers, the accuracy of a RS is a key factor in the success                                                                                  unified benchmark for this topic.
of many products. Evaluation of RSs has often been done                                                                                      2. We bring together the community on evaluation
considering point-wise metrics, such as HitRate (HR) or                                                                                         from both an industrial and research point of view,
nDCG over held-out data points, but the field has recently                                                                                      to foster an inclusive debate for a more nuanced
begun to recognize the importance of a more rounded                                                                                             evaluation of RSs.
evaluation as a better proxy to real-world performance
[1].                                                                                                                                    In this paper, we describe the conceptual and practical
   We designed E v a l R S as a new type of data challenge in                                                                         motivations behind E v a l R S , provide context on the orga-
which participants are asked to test their models incorpo-                                                                            nizers, related events and relevant literature, and explain
rating quantitative as well as behavioral insights. Using                                                                             the evaluation methodology we champion. For partici-
a popular open dataset – Last.fm – we go beyond sin-                                                                                  pation rules, up-to-date implementation details and all
gle aggregate numbers and instead require participants                                                                                the artifacts produced before and during the challenge,
to optimize for a wide range of recommender systems                                                                                   please refer to the E v a l R S official repository.1
properties. The contribution of this challenge is two-fold:
EvalRS 2022: CIKM EvalRS 2022 DataChallenge, October 21, 2022,                                                                        2. Motivation
Atlanta, GA
∗
     Corresponding author.                                                                                                            E v a l R S at CIKM 2022 complements the existing challenge
†
    TS proposed the format and methodology and worked with JT and
                                                                                                                                      landscape and it is driven by two different perspectives:
    FB towards a first draft. PC led the implementation and contributed
    most of the R e c L i s t code. GA, CG, FB and PC researched, iterated                                                            the first one coming from academic research, the sec-
    and operationalized behavioral tests. GM reviewed the API and                                                                     ond one from the industrial development of RSs. We
    implemented baselines, while GA, JT and FB prepared tutorials for                                                                 examined these in turn.
    participants. Everybody helped with drafting the paper, rules and
    guidelines. JT and FB acted as senior PIs in the project. JT and CG
    started this work at Coveo Labs, New York, NY, USA.
Envelope-Open tagliabue.jacopo@gmail.com (J. Tagliabue)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                      1
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                            https://github.com/RecList/evalRS-CIKM-2022.
2.1. A Research Perspective                                              over frequent items, in reality, new users and
                                                                         items can represent a big portion of them with
Although undeniable progress was made in the past years,
                                                                         significant business consequences: the cold-start
concerns have been raised about the status of research
                                                                         problem is believed to affect 50% of users [21]
advancements in the field of recommendations, partic-
                                                                         in a context where field studies found that 40%
ularly with respect to ephemeral processes in motivat-
                                                                         of shoppers would stop shopping if shown non-
ing architectural choices and lack of reproducibility [3].
                                                                         relevant recommendations [22].
This challenge draws attention to a further – and poten-
                                                                       • Use cases and industry idiosyncrasies: different
tially deeper – issue: even if the “reproducibility crisis”
                                                                         use cases in different industries present differ-
is solved, we are still mostly dealing with point-wise
                                                                         ent challenges. For instance, recommendations
quantitative metrics as the only benchmarks for RSs. As
                                                                         for complementary items in e-commerce need
reported by Sun et al. [4], the dominating metrics used
                                                                         to account for the fact that if item A is a good
in the evaluation of recommender systems published at
                                                                         complementary candidate for item B, the reverse
top-tier conferences (RecSys, SIGIR, CIKM) are standard
                                                                         might not hold (e.g. an HDMI cable is a good com-
information retrieval metrics, such as MRR, Recall, HITS,
                                                                         plementary item for a 4k TV, but not vice versa).
NDCG [5, 6, 7, 8, 9].
                                                                         Music recommendations need to deal with the
   While it is undoubtedly convenient to summarize
                                                                         issue of “hubness”, where popular items act as
the performance of different models via one score, this
                                                                         hubs in the top-N recommendation list of many
lossy projection discards a lot of important information
                                                                         users without being similar to the users’ profiles
on model behavior: for example, given the power-law
                                                                         and making other items invisible to the recom-
distribution in many real-world datasets ([10, 11, 12]),
                                                                         mender [23]. Such use-case specific traits are
marginal improvements on frequent items may translate
                                                                         particularly important when designing effective
in noticeable accuracy gains, even at the cost of signifi-
                                                                         testing procedures and often require considerable
cantly degrading the experience of subgroups. Metrics
                                                                         domain knowledge.
such as coverage, serendipity, and bias [13, 14, 15] are a
first step in the right direction, but they still fall short of        • Not all mistakes are equal: point-wise metrics are
capturing the full complexity of deploying RSs.                          unable to distinguish different types of mistakes;
   Following the pioneering work of [16] in Natural                      this is especially problematic for recommender
Language Processing, we propose to supplement stan-                      systems, as even a single mistake may cause great
dard retrieval metrics with new tests: in particular,                    social and reputational damage [24].
we encourage practitioners to go beyond the false di-                  • Robustness matters as much as accuracy: while his-
chotomy “quantitative-and-automated” vs “qualitative-                    torically a significant part of industry effort can be
and-manual”, and find a middle ground in which be-                       traced back to a few key players, there is a bloom-
havioral desiderata can be expressed transparently in                    ing market of Recommendation-as-a-Service sys-
code [1].                                                                tems designed to address the needs of “reasonable
                                                                         scale” systems [25]. Instead of vertical scaling and
                                                                         extreme optimization, SaaS providers emphasize
2.2. An Industrial Perspective                                           horizontal scaling through multiple deployments,
RSs in practice differ from RSs used in research in crucial              highlighting the importance of models that prove
ways. For example, in research, a static dataset is used                 to be flexible and robust across many dimensions
repeatedly, and there is no real interactivity between the               (e.g., traffic, industry, etc.).
model and users: prediction over a given point in time
𝑥𝑡 in the test set doesn’t change what happens at 𝑥𝑡+1 2 .           While not related to model evaluation per se, decision-
Even without considering the complexity of reproducing            making processes in the real world would also take into
real-world interactions for benchmarking purposes, we             account the different resources used by competing ap-
highlight four important themes from our experience in            proaches: time (both as time for training and latency for
building RSs at scale in production scenarios:                    serving), computing (CPU vs GPU), CO2 emissions are
                                                                  all typically included in an industry benchmark.
       • Cold-start performance: new/rare items and users
          are challenging for many models across indus-
          tries [19, 20]. In e-commerce, for instance, while
                                                                   3. EvalRS Challenge
          most “similar products” predictions will happen We propose to supplement standard retrieval metrics
                                                                   over held out data points with behavioral tests: in be-
2
  This is especially important in the context of sequential recom- havioral tests, we treat the target model as a black-box
  mender[17], which arguably resembles more reinforcement learn-
                                                                   and supply only input-output pairs (for example, query
ing than supervised inference with pseudo-feedback [18].
user and desired recommended song). In particular, we Table 1
leverage a recent open-source package, R e c L i s t [1], to Descriptive statistics for LFM dataset.
prepare a suite of tests for our target dataset (Section
                                                              Items                                                                    Value
3.1). In putting forward our tests, we operationalize the
intuitions from Section 2 through a general plug-and-         Users                                                                  119, 555
play API to facilitate model comparison and data prepa-       Artists                                                                 62, 943
ration, and by providing convenient abstractions and          Albums                                                              1, 374, 121
ready-made recommenders used as baselines.                    Tracks                                                                 820, 998
                                                                              Listening Events                                   37, 926, 429
                                                                              User-Track History Length (25/50/75 pct)         241/346/413
3.1. Use Case and Dataset
E v a l R S is a user-item recommendation challenge in the
music domain: participants are asked to train a model                             • We then performed another iteration of k-core
that, given a user id, recommends an appropriate song out                           filtering, this time on the u s e r - t r a c k interaction
of a known set of songs. The ground truth necessary to                              graph, with 𝑘 = 10 to retain only users and tracks
compute all the test metrics, quantitative and behavioral,                          which are informative.
is provided by our leave-one-out framework: for each                              • Lastly, the original dataset contained missing
user, we remove a song from their listening history and                             meta-data (e.g. there were t r a c k _ i d in the events
use it as the ground truth when evaluating the models.                              data which did not have corresponding track
    We provide test abstractions and an evaluation script                           metadata). We removed tracks, albums, artists
designed for LFM, a transformed version of LFM-1b                                   and events which had missing information.
dataset [2] – a dataset focused on music consumption on                           • We summarize the final dataset statistics in Table
Last.fm. We chose the LFM-1b dataset as the primary data                            1.
source after a thorough comparisons of popular datasets
for a unique combination of features. Given our focus                           Taken together, these features allow us to fulfill E v a l R S
on rounded evaluation and the importance of joining                          promise of offering a challenging setting and a rounded
prediction / ground truth with meta-data, LFM is an ideal                    evaluation. While a clear motivation behind the release of
dataset, as it provides rich song (artist, album informa-                    LFM-1b dataset was to offer “additional user descriptors
tion) and user (country, age, gender,3 time on platform)                     that reflect their music taste and consumption behavior”,
meta-data.                                                                   it is telling that both the modelling and the evaluation
    We applied principled data transformations to make                       by the original authors are still performed without any
E v a l R S amenable to a larger audience whilst preserving                  real use of these rich meta-data [27]. By taking a fresh
the rich information in the original dataset. We detail                      look at an existing, popular dataset, E v a l R S challenges
the data transformation process and our motivations:                         practitioners to think about models not just along famil-
                                                                             iar quantitative dimensions, but also along non-standard
         • First, we removed u s e r s and a r t i s t s which have          scores closer to human perception of relevance and fair-
           few interaction since they are likely to be too                   ness.
           sparse to be informative. Following the sugges-
           tions in, we apply k-core [26] filtering to the bipar-            3.2. Evaluation Metrics
           tite interaction graph between u s e r s and a r t i s t s ,
           setting 𝑘 = 10 (i.e. we retain vertices with a mini-              Submission are evaluated according to our randomized
           mum degree of k).                                                 loop (Section 3.3) over the testing suite released with the
         • After the aforementioned processing, the dataset                  challenge. At a first glance, tests can be roughly divided
           still contained over 900M events, which moti-                     in three main groups:
           vated further filtering of the data. In particular,
           we keep only the first interaction a u s e r had with                  • Standard RSs metrics: these are the typical
           a given t r a c k , and for each u s e r we retain only                  point-wise metrics used in the field (e.g. MRR,
           their 𝑁 = 500 most recent unique t r a c k interac-                      HR@K) – they are included as sanity checks and
           tions. We supplement the information lost dur-                           as a informative baseline against which insights
           ing this pruning step by providing the interaction                       gained through the behavioral tests can be inter-
           count between a u s e r and a t r a c k .                                preted.
                                                                                  • Standard metrics on a per-group or slice ba-
                                                                                    sis: as shown for example in [1], models which
3
    Gender in the original dataset is a binary variable. This is a limita-          are indistinguishable on the full test set may ex-
    tion, as it gives a stereotyped representation of gender. Our intent            hibit very different behavior across data slices.
    is not to make normative claims about gender.
          It is therefore crucial to quantify model perfor-         on each slice and the the MR obtained on the original
          mance for specific input and target groups, i.e.          test set is averaged and negated (so that a higher value
          is there a performance difference between males           implies better performance in the metric) to obtain the
          and females? Is there an accuracy drops when              final score for each test. The slice-based tests considered
          artists are not very popular?                             for the final scores are:
        • Behavioral tests: this group may include per-                  • Gender balance. This test is meant to address
          turbance tests (i.e. if we modify a user’s history               fairness towards gender [32]. Since the dataset
          by swapping Metallica with Pantera, how much                     only provides binary gender, the test will mini-
          will predictions change?), and error distance tests              mize the difference between the MR obtained on
          (i.e. if the ground truth is Shine On You Crazy Di-              users who specified Female as gender and the MR
          amond and the prediction is Smoke on the Water,                  obtained on the original test set. In other words,
          how severe is this error?).                                      the smaller the difference, the fairer the model
                                                                           towards potential gender biases.
   Based on this taxonomy, we now survey the tests im-
                                                                         • Artist popularity. This test is meant to address
plemented in the R e c L i s t powering E v a l R S , with refer-
                                                                           a known problem in music recommendations:
ence to relevant literature and examples from the target
                                                                           niche (or simply less known) artists and users
datasets. For implementation details please refer to the
                                                                           who are less interested in highly popular con-
official repository.4
                                                                           tent are often penalized by recommender systems
                                                                           [33, 34]. This point appears even more important
3.2.1. Standard RSs metrics                                                when we consider that several music streaming
Based on popular metrics in the literature, we picked two                  services (e.g. Spotify, Tidal) also act as market-
standard metrics as a quantitative baseline and sanity                     places for artists to promote their music. Since
check for our R e c L i s t :                                              splitting the test set in two would draw an arbi-
                                                                           trary line between popular vs. unpopular artists,
        • Mean Reciprocal Rank (MRR) as a measure of                       failing to capture the actual properties of the dis-
          where the first relevant element retrieved by the                tribution. Instead, we split the test set into bins
          model is ranked in the output list. Besides be-                  with equal size after logarithmic scaling.
          ing considered a standard rank-aware evaluation                • User country. Music consumption is subject
          metric, we chose MRR because it is particularly                  to many country dependent factors, such as lan-
          simple to compute and to interpret.                              guage differences, local sub-genres and styles, lo-
        • Hit Rate (HR), defined as Recall at k (𝑘 = 100),                 cal licensing and distribution laws, cultural in-
          i.e. the proportion of relevant items found in the               fluences of local traditional music, etc [35]. We
          top-k recommendation.                                            capture this diversity by slicing the test set based
                                                                           on the top-10 countries by user counts.
3.2.2. Standard metrics on a per-group or slice                          • Song popularity. This test measures the model
       basis                                                               performance on both popular tracks and on songs
                                                                           with fewer listening events. The test is designed
Models are tested to address a wide spectrum of known                      to address both robustness to long tail items and
issues for recommender systems, for instance: fairness                     cold-start scenarios, so we pooled together both
(e.g. a model should have equal outcomes for different                     less popular and newer songs. Again, we used
groups, e.g. [28, 29, 30]), robustness (e.g. a model should                logarithmic bucketing with base 10 to divide the
produce good outcomes also for long-tail items, such as                    test set in order to avoid arbitrary thresholds.
items with less history or belonging to less represented                 • User history. The test can be viewed as a
categories, e.g. [31]), industry-specific use-cases (e.g. in               robustness/cold-start test, in which we sliced the
the case of music, your model should not consistently                      dataset based on the length of user history on the
penalize niche or simply less known artists).                              platform. To create slices, we use the user play
   All the tests in this group are based on Miss Rate (MR),                counts (i.e. the sum of play counts per user) and
defined as ratio between the prediction errors (i.e. model                 we use logarithmic bucketing in base 10 to divide
predictions do not contain the ground truth) and the                       the test set in order to avoid arbitrary thresholds.
number of predictions. Slices can be generalized asn
partitions (e.g. Countries with UK/US/IT/FR and others              3.2.3. Behavioral and qualitative tests
is split is N partitions) of the test data forming n-ary
classes. The absolute difference between the MR obtained            Our final set of tests is behavioral in nature, and tries
                                                                    to capture (with some assumptions) how models differ
4
    https://github.com/RecList/evalRS-CIKM-2022.                    based on qualitative aspects:
     • Be less wrong. It is important that RSs maintain                  get participants comfortable, through harmless
       a reasonable standard of relevance even when                      iterations, with the dataset and the multi-faceted
       the predictions are not accurate. For instance,                   nature of the challenge.
       if the ground truth for a recommendation is the                2. Second phase: after the organizers have evaluated
       rap song ‘Humble’ by Kendrick Lamar, a model                      the score distributions for individual tests, they
       might suggest another rap song from the same                      will attach different weights to each test to pro-
       year (‘The story of O.J.’ by Jay-Z), or a famous pop              duce a balanced macro-score - i.e. if a test turns
       song from the top chart of that year (‘Shape of You’              out to be easy for most participants, its impor-
       by Ed Sheeran). There is still a substantial differ-              tance will be counter-biased in the calculation.
       ence between these two as the first one is closer                 At the beginning of this phase, participants are
       to the ground truth than the second. Since this                   asked to update their evaluation script by cloning
       has a great impact on the overall user experience,                again the data challenge repository: the purpose
       it is desirable that models test and measure their                for each team becomes now leveraging the in-
       performance scenarios like the one just described.                sights from the previous phase to optimize their
       We use the latent space of tracks to compute the                  models as much as possible for the leaderboard.
       average pairwise cosine distance between the em-                  Only scores obtained in this phase are considered
       beddings of the predicted items and the ground                    for the final prizes.
       truths.
     • Latent diversity: Diversity is closely tied with          3.3. Methodology
       the maximization of marginal relevance as a way
       to acknowledge uncertainty of user intent and             Since the focus of the challenge is a popular public dataset,
       to address user utility in terms of discovery [36].       we implemented a robust evaluation procedure to avoid
       Diversity is often considered a partial proxy for         data leakage and ensure fairness5 . Our protocol is split in
       fairness and it is an important measure of the per-       two phases: local – when teams iterate on their solution
       formance of recommender systems in real world             during the challenge - and remote – when organizers ver-
       scenarios [37]. We address diversity using the            ify the submissions at the end and proclaim the winners:
       latent space of tracks testing for model density
       - where density is defined as the summation of                  • Local evaluation protocol: For each fold, the pro-
       the differences between each point in the predic-                 vided script first samples 25% of the users in the
       tion space and the mean of the prediction space.                  dataset. It then partitions the dataset into training
       Additionally, in order to account also for the “cor-              and testing sets using the leave-one-out protocol:
       rectness” of prediction vectors, we calculate a                   the testing set comprises a list of unique users,
       bias defined as the distance between the ground                   where the target song for each of them has been
       truth vector and the mean of the prediction vec-                  picked randomly from their history. The train-
       tor and weight to penalize for high bias: the final               ing set is the listening history for these sampled
       score is computed as 0.3 * diversity - 0.7 * bias,                users with their test song removed. Participants’
       where 0.3 and 0.7 are weights that we determined                  models will be trained and tuned based on their
       empirically to balance diversity and correctness.                 custom logic on the training set, and then evalu-
                                                                         ated over the test suite (Section 3.2) to provide a
    Please note that since we aim at widening the commu-                 final score for each run (Section 3.2.4); partition-
nity contribution to testing, the final code submission for              ing, training, testing, scoring will be done for a
E v a l R S includes as a requirement that participants con-             total of 4 repetitions: the average of the runs will
tribute at least one custom test, by extending the provided              constitute the leaderboard score.
abstraction.                                                           • Remote evaluation protocol: the organizers will
                                                                         run the code submitted by participants, and re-
3.2.4. Final score                                                       peat the random evaluation loop. The scores thus
                                                                         obtained on the E v a l R S test suite will be compared
Since each of the tests above return a score from a poten-               with participants submissions as a sanity check
tially unique, non-normal distribution, we need a way to                 (statistical comparison of means and 95% boot-
define a macro-score for the leaderboard. To define the                  strapped CI).
formula we adopt an empirical approach in two phases:
    1. First phase: scores of individual tests are simply         Thanks to the provided APIs, participants will be able
       averaged to get the leaderboard macro-score. The        to run the full evaluation loop locally, as well as update
       purpose of this phase is to gather data on the rela- 5 To help participants with the implementation, we provide a template
       tive difficulty and utility of the different tests, and script that can be modified with custom model code.
their leaderboard score automatically through the pro-                      is Adj. Professor of MLSys at NYU, publishes regularly in
vided script. To ensure a fair and reproducible remote                      top-tier conferences (including NAACL, ACL, RecSys, SI-
evaluation, final submission should contain a docker im-                    GIR), and is co-organizer of SIGIR eCom. Jacopo was the
age that runs the local evaluation script and produces the                  lead organizer of the SIGIR Data Challenge 2021, spear-
desired output within the maximum allotted time on the                      heading the release of the largest session-based dataset
target cloud machine. Please check E v a l R S repository for               for eCommerce research.
the exact final requirements and up-to-date instructions.
                                                                 Federico Bianchi Federico Bianchi is a postdoctoral
                                                                 researcher at Stanford University. He obtained his Ph.D.
4. Organization, Community,                                      in Computer Science at the University of Milano-Bicocca
        Impact                                                   in 2020. His research, ranging from Natural Language
                                                                 Processing methods for textual analytics to recommender
4.1. Structure and timeline                                      systems for the e-commerce has been accepted to major
                                                                 NLP and AI conferences (EACL, NAACL, EMNLP, ACL,
E v a l R S unfolds in three main phases:
                                                                 AAAI, RecSys) and journals (Cognitive Science, Applied
        1. CHALLENGE: An open challenge phase, where Intelligence, Semantic Web Journals). He co-organized
            participating teams register for the challenge and the SIGIR Data Challenge 2021. He frequently releases his
            work on improving the scores on both standard research as open-source tools that have collected almost
            and behavioral metrics across the two phases ex- a thousand GitHub stars and been downloaded over 100
            plained above (3.2.4).                               thousand times.
        2. CFP: A call for papers, where teams submit a writ-
            ten contribution, describing their system, custom Tobias Schnabel Tobias Schnabel is a senior re-
            testing, data insights.                              searcher in the Productivity+Intelligence group at Mi-
        3. CONFERENCE: At the conference, winners will crosoft Research. He is interested in improving human-
            be announced and special prizes for novel testings facing machine learning systems in an integrated way,
            and oustanding student work will be awarded. considering not only algorithmic but also human fac-
            During the workshop, we plan to discuss solicited tors. To this end, his research draws from causal in-
            papers and host a round-table with experts on RSs ference, reinforcement learning, machine learning, HCI,
            evaluation.                                          and decision-making under uncertainty. He was a co-
                                                                 organizer for a WSDM workshop this year and has served
    Our CFP takes a “design paper” perspective, where as (senior) PC member for a wide array of AI and data
teams are invited to discuss both how they adapted their science conference (ICML, NeurIPS, WSDM, KDD). Be-
initial model to take into account the test suite, and how fore joining Microsoft, he obtained Ph.D. from the Com-
the tests strengthened their understanding of the target puter Science Department at Cornell University under
dataset and use case6 .                                          Thorsten Joachims.
    We emphasize the CFP and CONFERENCE steps as mo-
ments to share with the community additional tests, error Giuseppe Attanasio Giuseppe Attanasio is a postdoc-
analysis and data insights inspired by E v a l R S . By leverag- toral researcher at Bocconi, where he works on large-
ing RecList, we not only enable teams to quickly iterate scale neural architectures for Natural Language Process-
starting from our ideas, but we promise to immediately ing. His research focuses on understanding and regular-
circulate in the community their testing contribution izing models for debiasing and fairness purposes. His
through a popular open source package. Finally, we plan research on the topic has been accepted to major NLP
on using CEUR-WS to publish the accepted papers, as conferences (ACL). While working at Bocconi, he is con-
well as drafting a final public report as an additional, cluding his Ph.D. at the Department of Control and Com-
actionable artifacts from the challenge.                         puter Engineering at Politecnico di Torino.

4.2. Organizers                                                             Ciro Greco Ciro Greco was the co-founder and CEO of
Jacopo Tagliabue Jacopo Tagliabue was co-founder                            Tooso, a San Francisco based startup specialized in Infor-
of Tooso, an Information Retrieval company acquired by                      mation Retrieval. Tooso was acquired in 2019 by Coveo,
Coveo in 2019. As Director of AI at Coveo, he divides his                   where he now works as VP or Artificial Intelligence.He
time between product, research, and evangelization: he                      holds a Ph.D. in Linguistics and Cognitive Neuroscience
                                                                            at Milano-Bicocca. He worked as visiting scholar at MIT
6
    As customary in these events, we will involve a small committee         and as a post-doctoral fellow at Ghent University. He
    from top-tier practitioners and scholars to ensure the quality of the   published extensively in top-tier conferences (including
    final submissions.
NAACL, ACL, RecSys, SIGIR) and scientific journals (The          the evaluation of RSs and fairness; second, researchers
Linguistic Review, Cognitive Science, Nature Commu-              who proposed a new model and desire to test its gener-
nications). He was also co-organizer of the SIGIR Data           alization abilities on new metrics; third, industrial prac-
Challenge 2021.                                                  titioners that started using R e c L i s t after its release in
                                                                 recent months, and already signaled strong support for
Gabriel de Souza P. Moreira Gabriel Moreira is a Sr.             behavioral testing in their real-world use cases.
Applied Research Scientist at NVIDIA, leading the re-                E v a l R S makes a novel and significant contribution to
search efforts of Merlin research team. He had his PhD           the community: first, we ask practitioners to “live and
degree from ITA university, Brazil, with a focus on Deep         breath” the problem of evaluation, operationalizing prin-
Learning for RecSys and Session-based recommendation.            ciples and insights through sharable code; second, we
Before joining NVIDIA, he was lead Data Scientist at             embrace a “build in the open” approach, as all artifacts
CI&T for 5 years, after working as software engineer for         from the event will be available to the community as
more than a decade. In 2019, he was recognized as a              a permanent contribution, in the form of open source
Google Developer Expert (GDE) for Machine Learning.              code, design papers, and public documentation – through
He was part of the NVIDIA teams that won recent Rec-             prizes assigned based on scores, but also outstanding
Sys competitions: ACM RecSys Challenge 2020, WSDM                testing and paper contributions, and special awards for
WebTour Workshop Challenge 2021 by Booking.com and               students, we hope to actively encourage more practition-
the SIGIR eCommerce Workshop Data Challenge 2021                 ers to join the evaluation debate and get a more diverse
by Coveo.                                                        set of perspectives for our workshop.
                                                                     As argued throughout this paper, when comparing
Patrick John Chia Patrick John Chia is an Applied                E v a l R S methodology to typical data challenges, we can
Scientist at Coveo. Prior to this, he completed his Mas-         summarize three important differentiating factors: first,
ter’s degree at Imperial College London and spent a year         we fight public leaderboard overfitting through our ran-
at Massachusetts Institute of Technology (MIT). He was           domized evaluation loop; second, we discourage complex
co-organizer of the 2021 SIGIR Data Challenge and has            solutions that cannot be practically used, as our open
been a speaker on topics at the intersection of Machine          source code competition provides a fixed (and reason-
Learning and eCommerce (SIGIR eCom, ECNLP at ACL).               able) compute budget; third and most importantly, with
His latest interests lie in developing AI that has the ability   a thorough evaluation with per-group and behavioral
to learn like infants and applying it to creating solutions      tests, we encourage participants to seek non-standard
at Coveo.                                                        performance and discuss fairness implications.
                                                                     We strongly believe these points will lay down the
                                                                 foundation for a first-of-its-kind automatic, shared, iden-
5. Similar Events and Broader                                    tifiable evaluation standard for RSs.
   Outlook
The CIKM-related community has shown great interest
                                                                 6. ACKNOWLEDGEMENTS
in themes at the intersection of aligning machine learn-         R e c L i s t is an open source library whose development is
ing with human judgment, rigorous evaluation settings,           supported by forward looking companies in the machine
and fairness, as witnessed by popular Data Challenges            learning community: the organizers wish to thank Comet,
and important workshops in top-tier venues. Among                Neptune, Gantry for their generous support.7
recent challenges, the 2021 SIGIR-Ecom Data Challenge,
the 2021 Booking Data Challenge, and the 2020 RecSys
Challenge are all events centered around the evaluation          References
of RSs, yet still substantially different: for example, the
SIGIR Challenge focused on MRR as a success metric [10],             [1] P. J. Chia, J. Tagliabue, F. Bianchi, C. He,
while the Booking Challenge [38] used top-k accuracy.                    B. Ko,      Beyond NDCG: behavioral testing
   Moreover, the growing interest for rounded evaluation                 of recommender systems with reclist, CoRR
led to the creation of many interesting workshops in re-                 abs/2111.09963 (2021). URL: https://arxiv.org/abs/
cent years, such as IntRS: Joint Workshop on Interfaces and              2111.09963. a r X i v : 2 1 1 1 . 0 9 9 6 3 .
Human Decision Making for Recommender Systems, Im-                   [2] M. Schedl, The lfm-1b dataset for music re-
pactRS: Workshop on the Impact of Recommender Systems                    trieval and recommendation, in: Proceedings of
and FAccTRec: Workshop on Responsible Recommendation.                    the 2016 ACM on International Conference on
For this reason, we expect this challenge to attract a di-               Multimedia Retrieval, ICMR ’16, Association for
verse set of practitioners: first, researchers interested in
                                                                 7
                                                                     Please check the project website for more details: https://reclist.io/.
     Computing Machinery, New York, NY, USA, 2016,                      [11] F. M. Harper, J. A. Konstan, The movielens datasets:
     p. 103–110. URL: https://doi.org/10.1145/2911996.                       History and context, ACM Trans. Interact. Intell.
     2912004. doi:1 0 . 1 1 4 5 / 2 9 1 1 9 9 6 . 2 9 1 2 0 0 4 .            Syst. 5 (2015). URL: https://doi.org/10.1145/2827872.
 [3] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we                         doi:1 0 . 1 1 4 5 / 2 8 2 7 8 7 2 .
     really making much progress? a worrying analy-                     [12] H. Zamani, M. Schedl, P. Lamere, C.-W. Chen, An
     sis of recent neural recommendation approaches,                         analysis of approaches taken in the acm recsys chal-
     in: Proceedings of the 13th ACM Conference on                           lenge 2018 for automatic music playlist continua-
     Recommender Systems, RecSys ’19, Association for                        tion, ACM Trans. Intell. Syst. Technol. 10 (2019).
     Computing Machinery, New York, NY, USA, 2019,                           URL: https://doi.org/10.1145/3344257. doi:1 0 . 1 1 4 5 /
     p. 101–109. URL: https://doi.org/10.1145/3298689.                       3344257.
     3347058. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 5 8 .       [13] D. Kotkov, J. Veijalainen, S. Wang, Challenges of
 [4] Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang,                       serendipity in recommender systems, in: WEBIST,
     C. Geng, Are we evaluating rigorously? bench-                           2016.
     marking recommendation for reproducible evalua-                    [14] D. Jannach, M. Ludewig, When recurrent neural
     tion and fair comparison, in: Fourteenth ACM con-                       networks meet the neighborhood for session-based
     ference on recommender systems, 2020, pp. 23–32.                        recommendation, in: Proceedings of the Eleventh
 [5] X. Wang, X. He, M. Wang, F. Feng, T.-S. Chua, Neu-                      ACM Conference on Recommender Systems, 2017,
     ral graph collaborative filtering, in: Proceedings of                   pp. 306–310.
     the 42nd international ACM SIGIR conference on                     [15] M. Ludewig, D. Jannach, Evaluation of session-
     Research and development in Information Retrieval,                      based recommendation algorithms, User Modeling
     2019, pp. 165–174.                                                      and User-Adapted Interaction 28 (2018) 331–390.
 [6] A. Rashed, S. Jawed, L. Schmidt-Thieme,                            [16] M. T. Ribeiro, T. S. Wu, C. Guestrin, S. Singh, Be-
     A. Hintsches,           Multirec: A multi-relational                    yond accuracy: Behavioral testing of nlp models
     approach for unique item recommendation in                              with checklist, in: ACL, 2020.
     auction systems, Fourteenth ACM Conference on                      [17] G. d. S. P. Moreira, S. Rabhi, J. M. Lee, R. Ak,
     Recommender Systems (2020).                                             E. Oldridge, Transformers4rec: Bridging the gap
 [7] P. Kouki, I. Fountalis, N. Vasiloglou, X. Cui, E. Lib-                  between nlp and sequential/session-based recom-
     erty, K. Al Jadda, From the lab to production: A                        mendation, in: Fifteenth ACM Conference on Rec-
     case study of session-based recommendations in the                      ommender Systems, 2021, pp. 143–153.
     home-improvement domain, in: Fourteenth ACM                        [18] K. Ariu, N. Ryu, S. Yun, A. Proutière,              Re-
     Conference on Recommender Systems, RecSys                               gret in online recommendation systems, ArXiv
     ’20, Association for Computing Machinery, New                           abs/2010.12363 (2020).
     York, NY, USA, 2020, p. 140–149. URL: https://doi.                 [19] J. Tagliabue, B. Yu, F. Bianchi, The Embeddings
     org/10.1145/3383313.3412235. doi:1 0 . 1 1 4 5 / 3 3 8 3 3 1 3 .        That Came in From the Cold: Improving Vectors
     3412235.                                                                for New and Rare Products with Content-Based
 [8] T. Moins, D. Aloise, S. J. Blanchard, Recseats:                         Inference, Association for Computing Machinery,
     A hybrid convolutional neural network choice                            New York, NY, USA, 2020, p. 577–578. URL: https:
     model for seat recommendations at reserved seat-                        //doi.org/10.1145/3383313.3411477.
     ing venues, in: Fourteenth ACM Conference on                       [20] L. Briand, G. Salha-Galvan, W. Bendada, M. Mor-
     Recommender Systems, RecSys ’20, Association for                        lon, V.-A. Tran, A semi-personalized system for
     Computing Machinery, New York, NY, USA, 2020,                           user cold start recommendation on music stream-
     p. 309–317. URL: https://doi.org/10.1145/3383313.                       ing apps, 2020. URL: arXiv:2106.03819.
     3412263. doi:1 0 . 1 1 4 5 / 3 3 8 3 3 1 3 . 3 4 1 2 2 6 3 .       [21] M. Hendriksen, E. Kuiper, P. Nauts, S. Schelter,
 [9] F. Bianchi, J. Tagliabue, B. Yu, Query2Prod2Vec:                        M. de Rijke, Analyzing and predicting purchase
     Grounded word embeddings for eCommerce, in:                             intent in e-commerce: Anonymous vs. identified
     Proceedings of the 2021 Conference of the North                         customers, 2020. URL: https://arxiv.org/abs/2012.
     American Chapter of the Association for Computa-                        08777.
     tional Linguistics: Human Language Technologies:                   [22] Krista Garcia, The impact of product recommen-
     Industry Papers, Association for Computational                          dations, 2018. URL: https://www.emarketer.com/
     Linguistics, Online, 2021, pp. 154–162. URL: https:                     content/the-impact-of-product-recommendations.
     //aclanthology.org/2021.naacl-industry.20. doi:1 0 .               [23] A. Flexer, D. Schnitzer, J. Schlueter, A mirex meta-
     18653/v1/2021.naacl- industry.20.                                       analysis of hubness in audio music similarity, 2012.
[10] J. Tagliabue, C. Greco, J.-F. Roy, F. Bianchi, G. Cas-             [24] M. Twohey, G. J. Dance, Lawmakers press amazon
     sani, B. Yu, P. J. Chia, Sigir 2021 e-commerce work-                    on sales of chemical used in suicides, 2022. URL:
     shop data challenge, in: SIGIR eCom 2021, 2021.                         https://www.nytimes.com/2022/02/04/technology/
     amazon-suicide-poison-preservative.html.               mender systems, in: ACM WSDM Workshop on
[25] J. Tagliabue, You Do Not Need a Bigger Boat: Rec-      Web Tourism (WSDM WebTour’21), 2021.
     ommendations at Reasonable Scale in a (Mostly)
     Serverless and Open Stack, Association for Com-
     puting Machinery, New York, NY, USA, 2021,
     p. 598–600. URL: https://doi.org/10.1145/3460231.
     3474604.
[26] V. Batagelj, M. Zaveršnik, Generalized cores, Ad-
     vances in Data Analysis and Classification 5 (2011)
     129–145.
[27] M. Schedl, Investigating country-specific music
     preferences and music recommendation algorithms
     with the lfm-1b dataset, International Journal of
     Multimedia Information Retrieval 6 (2017) 71 – 84.
[28] J. S. Ke Yang, Measuring fairness in ranked out-
     puts, in: SSDBM 2017: Proceedings of the 29th
     International Conference on Scientific and Statis-
     tical Database Management, 2017, pp. 1 – 6. URL:
     https://doi.org/10.1145/3085504.3085526.
[29] C. Castillo, Fairness and transparency in rank-
     ing, in: ACM SIGIR ForumVolume, volume Volume
     52, 2019, pp. 64 – 71. URL: https://doi.org/10.1145/
     3308774.3308783.
[30] M. Zehlike, K. Yang, J. Stoyanovich, Fairness in
     ranking: A survey, in: TBD. ACM, 2020, pp. 1–58.
     URL: https://arxiv.org/pdf/2103.14000.pdf.
[31] M. O’Mahony, N. Hurley, N. Kushmerick, G. Sil-
     vestre, Collaborative recommendation: A robust-
     ness analysis, volume 4, 2004. URL: https://doi.org/
     10.1145/1031114.1031116.
[32] S. Saxena, S. Jain, Exploring and mitigating gender
     bias in recommender systems with explicit feedback
     (2021). URL: arXivpreprintarXiv:2112.02530.
[33] D. Kowald, M. Schedl, E. Lex, The unfairness of
     popularity bias in music recommendation: A repro-
     ducibility study, European conference on informa-
     tion retrieval (2020).
[34] Òscar Celma, P. Cano, From hits to niches? or how
     popular artists can bias music recommendation
     and discovery, in: Proceedings of the 2nd KDD
     Workshop on Large-Scale Recommender Systems
     and the Netflix Prize Competition, 2008. URL:
     https://citeseerx.ist.psu.edu/viewdoc/download?
     doi=10.1.1.168.5009&rep=rep1&type=pdf.
[35] P. Bello, D. Garcia, Cultural divergence in popular
     music: the increasing diversity of music consump-
     tion on spotify across countries, Humanities and
     Social Sciences Communications 8 (2021).
[36] M. Drosou, H. Jagadish, E. Pitoura, J. Stoyanovich,
     Diversity in big data: A review, Big data 5.2 (2017)
     73–84.
[37] Diversity in recommender systems – a survey,
     Knowledge-Based Systems (2017) 154–162.
[38] M. Baigorria Alonso, Data augmentation us-
     ing many-to-many rnns for session-aware recom-