<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EvalRS: a rounded evaluation of recom mender systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jacopo Tagliabue</string-name>
          <email>tagliabue.jacopo@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Bianchi</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tobias Schnabel</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Attanasio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ciro Greco</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriel de Souza P. Moreira</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick John Chia</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bocconi University</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Coveo Labs</institution>
          ,
          <addr-line>New York, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Coveo</institution>
          ,
          <addr-line>Montreal</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Microsoft</institution>
          ,
          <addr-line>Redmond, WA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>NVIDIA</institution>
          ,
          <addr-line>São Paulo</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Stanford University</institution>
          ,
          <addr-line>Stanford, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Much of the complexity of recommender systems (RSs) comes from the fact that they are used as part of highly diverse real-world applications which requires them to deal with a wide array of user needs. However, research has focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow focus has limited the capacity of RSs to have a lasting impact in the real world and makes them vulnerable to undesired behavior, such as the reinforcement of data biases. We propose EvalRS as a new type of challenge, in order to foster this discussion among practitioners and build in the open new methodologies for testing RSs “in the wild”.</p>
      </abstract>
      <kwd-group>
        <kwd>recommender systems</kwd>
        <kwd>behavioral testing</kwd>
        <kwd>open source</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recommender systems (RSs) are embedded in most
applications we use today. From streaming services to online
retailers, the accuracy of a RS is a key factor in the success
of many products. Evaluation of RSs has often been done
considering point-wise metrics, such as HitRate (HR) or
nDCG over held-out data points, but the field has recently
begun to recognize the importance of a more rounded
evaluation as a better proxy to real-world performance
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>We designed E v a l R S as a new type of data challenge in
which participants are asked to test their models
incorporating quantitative as well as behavioral insights. Using
a popular open dataset – Last.fm – we go beyond
single aggregate numbers and instead require participants
to optimize for a wide range of recommender systems
properties. The contribution of this challenge is two-fold:
EvalRS 2022: CIKM EvalRS 2022 DataChallenge, October 21, 2022,
FB towards a first draft. PC led the implementation and contributed
most of the RecList code. GA, CG, FB and PC researched, iterated
and operationalized behavioral tests. GM reviewed the API and
participants. Everybody helped with drafting the paper, rules and
guidelines. JT and FB acted as senior PIs in the project. JT and CG
2.</p>
    </sec>
    <sec id="sec-2">
      <title>Motivation</title>
      <sec id="sec-2-1">
        <title>E v a l R S at CIKM 2022 complements the existing challenge</title>
        <p>landscape and it is driven by two diferent perspectives:
the first one coming from academic research, the
second one from the industrial development of RSs. We</p>
      </sec>
      <sec id="sec-2-2">
        <title>1https://github.com/RecList/evalRS-CIKM-2022.</title>
        <sec id="sec-2-2-1">
          <title>2.1. A Research Perspective</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Although undeniable progress was made in the past years,</title>
        <p>concerns have been raised about the status of research
advancements in the field of recommendations,
particularly with respect to ephemeral processes in
motivating architectural choices and lack of reproducibility [3].
This challenge draws attention to a further – and
potentially deeper – issue: even if the “reproducibility crisis”
is solved, we are still mostly dealing with point-wise
quantitative metrics as the only benchmarks for RSs. As
reported by Sun et al. [4], the dominating metrics used
in the evaluation of recommender systems published at
top-tier conferences (RecSys, SIGIR, CIKM) are standard
information retrieval metrics, such as MRR, Recall, HITS,
NDCG [5, 6, 7, 8, 9].</p>
        <p>While it is undoubtedly convenient to summarize
the performance of diferent models via one score, this
lossy projection discards a lot of important information
on model behavior: for example, given the power-law
distribution in many real-world datasets ([10, 11, 12]),
marginal improvements on frequent items may translate
in noticeable accuracy gains, even at the cost of
significantly degrading the experience of subgroups. Metrics
such as coverage, serendipity, and bias [13, 14, 15] are a
ifrst step in the right direction, but they still fall short of
capturing the full complexity of deploying RSs.</p>
        <p>
          Following the pioneering work of [16] in Natural
Language Processing, we propose to supplement
standard retrieval metrics with new tests: in particular,
we encourage practitioners to go beyond the false
dichotomy “quantitative-and-automated” vs
“qualitativeand-manual”, and find a middle ground in which
behavioral desiderata can be expressed transparently in
code [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <sec id="sec-2-3-1">
          <title>2.2. An Industrial Perspective</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>RSs in practice difer from RSs used in research in crucial</title>
        <p>ways. For example, in research, a static dataset is used
repeatedly, and there is no real interactivity between the
model and users: prediction over a given point in time
  in the test set doesn’t change what happens at  +1 2.
Even without considering the complexity of reproducing
real-world interactions for benchmarking purposes, we
highlight four important themes from our experience in
building RSs at scale in production scenarios:
• Cold-start performance: new/rare items and users
are challenging for many models across
industries [19, 20]. In e-commerce, for instance, while
most “similar products” predictions will happen
2This is especially important in the context of sequential
recommender[17], which arguably resembles more reinforcement
learning than supervised inference with pseudo-feedback [18].
over frequent items, in reality, new users and
items can represent a big portion of them with
significant business consequences: the cold-start
problem is believed to afect 50% of users [ 21]
in a context where field studies found that 40%
of shoppers would stop shopping if shown
nonrelevant recommendations [22].
• Use cases and industry idiosyncrasies: diferent
use cases in diferent industries present
diferent challenges. For instance, recommendations
for complementary items in e-commerce need
to account for the fact that if item A is a good
complementary candidate for item B, the reverse
might not hold (e.g. an HDMI cable is a good
complementary item for a 4k TV, but not vice versa).
Music recommendations need to deal with the
issue of “hubness”, where popular items act as
hubs in the top-N recommendation list of many
users without being similar to the users’ profiles
and making other items invisible to the
recommender [23]. Such use-case specific traits are
particularly important when designing efective
testing procedures and often require considerable
domain knowledge.
• Not all mistakes are equal: point-wise metrics are
unable to distinguish diferent types of mistakes;
this is especially problematic for recommender
systems, as even a single mistake may cause great
social and reputational damage [24].
• Robustness matters as much as accuracy: while
historically a significant part of industry efort can be
traced back to a few key players, there is a
blooming market of Recommendation-as-a-Service
systems designed to address the needs of “reasonable
scale” systems [25]. Instead of vertical scaling and
extreme optimization, SaaS providers emphasize
horizontal scaling through multiple deployments,
highlighting the importance of models that prove
to be flexible and robust across many dimensions
(e.g., trafic, industry, etc.).</p>
        <p>While not related to model evaluation per se,
decisionmaking processes in the real world would also take into
account the diferent resources used by competing
approaches: time (both as time for training and latency for
serving), computing (CPU vs GPU), CO2 emissions are
all typically included in an industry benchmark.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. EvalRS Challenge</title>
      <p>
        We propose to supplement standard retrieval metrics
over held out data points with behavioral tests: in
behavioral tests, we treat the target model as a black-box
and supply only input-output pairs (for example, query
user and desired recommended song). In particular, we
leverage a recent open-source package, R e c L i s t [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], to
prepare a suite of tests for our target dataset (Section
3.1). In putting forward our tests, we operationalize the
intuitions from Section 2 through a general
plug-andplay API to facilitate model comparison and data
preparation, and by providing convenient abstractions and
ready-made recommenders used as baselines.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Use Case and Dataset</title>
        <p>E v a l R S is a user-item recommendation challenge in the
music domain: participants are asked to train a model • We then performed another iteration of k-core
that, given a user id, recommends an appropriate song out ifltering, this time on the u s e r - t r a c k interaction
of a known set of songs. The ground truth necessary to graph, with  = 10 to retain only users and tracks
compute all the test metrics, quantitative and behavioral, which are informative.
is provided by our leave-one-out framework: for each • Lastly, the original dataset contained missing
user, we remove a song from their listening history and meta-data (e.g. there were t r a c k _ i d in the events
use it as the ground truth when evaluating the models. data which did not have corresponding track</p>
        <p>
          We provide test abstractions and an evaluation script metadata). We removed tracks, albums, artists
designed for LFM, a transformed version of LFM-1b and events which had missing information.
dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] – a dataset focused on music consumption on • We summarize the final dataset statistics in Table
Last.fm. We chose the LFM-1b dataset as the primary data 1.
source after a thorough comparisons of popular datasets
for a unique combination of features. Given our focus Taken together, these features allow us to fulfill E v a l R S
on rounded evaluation and the importance of joining promise of ofering a challenging setting and a rounded
prediction / ground truth with meta-data, LFM is an ideal evaluation. While a clear motivation behind the release of
dataset, as it provides rich song (artist, album informa- LFM-1b dataset was to ofer “additional user descriptors
tion) and user (country, age, gender,3 time on platform) that reflect their music taste and consumption behavior”,
meta-data. it is telling that both the modelling and the evaluation
        </p>
        <p>We applied principled data transformations to make by the original authors are still performed without any
E v a l R S amenable to a larger audience whilst preserving real use of these rich meta-data [27]. By taking a fresh
the rich information in the original dataset. We detail look at an existing, popular dataset, E v a l R S challenges
the data transformation process and our motivations: practitioners to think about models not just along
familiar quantitative dimensions, but also along non-standard
scores closer to human perception of relevance and
fairness.
• First, we removed u s e r s and a r t i s t s which have
few interaction since they are likely to be too
sparse to be informative. Following the
suggestions in, we apply k-core [26] filtering to the bipar- 3.2. Evaluation Metrics
tite interaction graph between u s e r s and a r t i s t s ,
setting  = 10 (i.e. we retain vertices with a mini- Submission are evaluated according to our randomized
mum degree of k). loop (Section 3.3) over the testing suite released with the
• After the aforementioned processing, the dataset challenge. At a first glance, tests can be roughly divided
still contained over 900M events, which moti- in three main groups:
vated further filtering of the data. In particular,
we keep only the first interaction a u s e r had with
a given t r a c k , and for each u s e r we retain only
their  = 500 most recent unique t r a c k
interactions. We supplement the information lost
during this pruning step by providing the interaction
count between a u s e r and a t r a c k .</p>
        <p>
          • Standard RSs metrics: these are the typical
point-wise metrics used in the field (e.g. MRR,
HR@K) – they are included as sanity checks and
as a informative baseline against which insights
gained through the behavioral tests can be
interpreted.
• Standard metrics on a per-group or slice
basis: as shown for example in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], models which
are indistinguishable on the full test set may
exhibit very diferent behavior across data slices.
        </p>
        <sec id="sec-3-1-1">
          <title>3Gender in the original dataset is a binary variable. This is a limita</title>
          <p>tion, as it gives a stereotyped representation of gender. Our intent
is not to make normative claims about gender.</p>
          <p>It is therefore crucial to quantify model perfor- on each slice and the the MR obtained on the original
mance for specific input and target groups, i.e. test set is averaged and negated (so that a higher value
is there a performance diference between males implies better performance in the metric) to obtain the
and females? Is there an accuracy drops when final score for each test. The slice-based tests considered
artists are not very popular? for the final scores are:
• Behavioral tests: this group may include
perturbance tests (i.e. if we modify a user’s history
by swapping Metallica with Pantera, how much
will predictions change?), and error distance tests
(i.e. if the ground truth is Shine On You Crazy
Diamond and the prediction is Smoke on the Water,
how severe is this error?).</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Based on this taxonomy, we now survey the tests im</title>
          <p>plemented in the R e c L i s t powering E v a l R S , with
reference to relevant literature and examples from the target
datasets. For implementation details please refer to the
oficial repository. 4
3.2.1. Standard RSs metrics
Based on popular metrics in the literature, we picked two
standard metrics as a quantitative baseline and sanity
check for our R e c L i s t :
• Mean Reciprocal Rank (MRR) as a measure of
where the first relevant element retrieved by the
model is ranked in the output list. Besides
being considered a standard rank-aware evaluation
metric, we chose MRR because it is particularly
simple to compute and to interpret.
• Hit Rate (HR), defined as Recall at k ( = 100 ),
i.e. the proportion of relevant items found in the
top-k recommendation.
3.2.2. Standard metrics on a per-group or slice</p>
          <p>basis
Models are tested to address a wide spectrum of known
issues for recommender systems, for instance: fairness
(e.g. a model should have equal outcomes for diferent
groups, e.g. [28, 29, 30]), robustness (e.g. a model should
produce good outcomes also for long-tail items, such as
items with less history or belonging to less represented
categories, e.g. [31]), industry-specific use-cases (e.g. in
the case of music, your model should not consistently
penalize niche or simply less known artists).</p>
          <p>All the tests in this group are based on Miss Rate (MR),
defined as ratio between the prediction errors (i.e. model
predictions do not contain the ground truth) and the
number of predictions. Slices can be generalized asn
partitions (e.g. Countries with UK/US/IT/FR and others
is split is N partitions) of the test data forming n-ary
classes. The absolute diference between the MR obtained
4https://github.com/RecList/evalRS-CIKM-2022.
• Gender balance. This test is meant to address
fairness towards gender [32]. Since the dataset
only provides binary gender, the test will
minimize the diference between the MR obtained on
users who specified Female as gender and the MR
obtained on the original test set. In other words,
the smaller the diference, the fairer the model
towards potential gender biases.
• Artist popularity. This test is meant to address
a known problem in music recommendations:
niche (or simply less known) artists and users
who are less interested in highly popular
content are often penalized by recommender systems
[33, 34]. This point appears even more important
when we consider that several music streaming
services (e.g. Spotify, Tidal) also act as
marketplaces for artists to promote their music. Since
splitting the test set in two would draw an
arbitrary line between popular vs. unpopular artists,
failing to capture the actual properties of the
distribution. Instead, we split the test set into bins
with equal size after logarithmic scaling.
• User country. Music consumption is subject
to many country dependent factors, such as
language diferences, local sub-genres and styles,
local licensing and distribution laws, cultural
inlfuences of local traditional music, etc [ 35]. We
capture this diversity by slicing the test set based
on the top-10 countries by user counts.
• Song popularity. This test measures the model
performance on both popular tracks and on songs
with fewer listening events. The test is designed
to address both robustness to long tail items and
cold-start scenarios, so we pooled together both
less popular and newer songs. Again, we used
logarithmic bucketing with base 10 to divide the
test set in order to avoid arbitrary thresholds.
• User history. The test can be viewed as a
robustness/cold-start test, in which we sliced the
dataset based on the length of user history on the
platform. To create slices, we use the user play
counts (i.e. the sum of play counts per user) and
we use logarithmic bucketing in base 10 to divide
the test set in order to avoid arbitrary thresholds.
3.2.3. Behavioral and qualitative tests
Our final set of tests is behavioral in nature, and tries
to capture (with some assumptions) how models difer
based on qualitative aspects:
• Be less wrong. It is important that RSs maintain get participants comfortable, through harmless
a reasonable standard of relevance even when iterations, with the dataset and the multi-faceted
the predictions are not accurate. For instance, nature of the challenge.
if the ground truth for a recommendation is the 2. Second phase: after the organizers have evaluated
rap song ‘Humble’ by Kendrick Lamar, a model the score distributions for individual tests, they
might suggest another rap song from the same will attach diferent weights to each test to
proyear (‘The story of O.J.’ by Jay-Z), or a famous pop duce a balanced macro-score - i.e. if a test turns
song from the top chart of that year (‘Shape of You’ out to be easy for most participants, its
imporby Ed Sheeran). There is still a substantial difer- tance will be counter-biased in the calculation.
ence between these two as the first one is closer At the beginning of this phase, participants are
to the ground truth than the second. Since this asked to update their evaluation script by cloning
has a great impact on the overall user experience, again the data challenge repository: the purpose
it is desirable that models test and measure their for each team becomes now leveraging the
inperformance scenarios like the one just described. sights from the previous phase to optimize their
We use the latent space of tracks to compute the models as much as possible for the leaderboard.
average pairwise cosine distance between the em- Only scores obtained in this phase are considered
beddings of the predicted items and the ground for the final prizes .</p>
          <p>truths.
• Latent diversity: Diversity is closely tied with 3.3. Methodology
the maximization of marginal relevance as a way
to acknowledge uncertainty of user intent and Since the focus of the challenge is a popular public dataset,
to address user utility in terms of discovery [36]. we implemented a robust evaluation procedure to avoid
Diversity is often considered a partial proxy for data leakage and ensure fairness5. Our protocol is split in
fairness and it is an important measure of the per- two phases: local – when teams iterate on their solution
formance of recommender systems in real world during the challenge - and remote – when organizers
verscenarios [37]. We address diversity using the ify the submissions at the end and proclaim the winners:
latent space of tracks testing for model density
- where density is defined as the summation of
the diferences between each point in the
prediction space and the mean of the prediction space.</p>
          <p>Additionally, in order to account also for the
“correctness” of prediction vectors, we calculate a
bias defined as the distance between the ground
truth vector and the mean of the prediction
vector and weight to penalize for high bias: the final
score is computed as 0.3 * diversity - 0.7 * bias,
where 0.3 and 0.7 are weights that we determined
empirically to balance diversity and correctness.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Please note that since we aim at widening the commu</title>
          <p>nity contribution to testing, the final code submission for
E v a l R S includes as a requirement that participants
contribute at least one custom test, by extending the provided
abstraction.
3.2.4. Final score
Since each of the tests above return a score from a
potentially unique, non-normal distribution, we need a way to
define a macro-score for the leaderboard. To define the
formula we adopt an empirical approach in two phases:
1. First phase: scores of individual tests are simply
averaged to get the leaderboard macro-score. The
• Local evaluation protocol: For each fold, the
provided script first samples 25% of the users in the
dataset. It then partitions the dataset into training
and testing sets using the leave-one-out protocol:
the testing set comprises a list of unique users,
where the target song for each of them has been
picked randomly from their history. The
training set is the listening history for these sampled
users with their test song removed. Participants’
models will be trained and tuned based on their
custom logic on the training set, and then
evaluated over the test suite (Section 3.2) to provide a
ifnal score for each run (Section 3.2.4);
partitioning, training, testing, scoring will be done for a
total of 4 repetitions: the average of the runs will
constitute the leaderboard score.
• Remote evaluation protocol: the organizers will
run the code submitted by participants, and
repeat the random evaluation loop. The scores thus
obtained on the E v a l R S test suite will be compared
with participants submissions as a sanity check
(statistical comparison of means and 95%
bootstrapped CI).</p>
          <p>Thanks to the provided APIs, participants will be able
to run the full evaluation loop locally, as well as update
purpose of this phase is to gather data on the rela- 5To help participants with the implementation, we provide a template
tive dificulty and utility of the diferent tests, and script that can be modified with custom model code.
their leaderboard score automatically through the pro- is Adj. Professor of MLSys at NYU, publishes regularly in
vided script. To ensure a fair and reproducible remote top-tier conferences (including NAACL, ACL, RecSys,
SIevaluation, final submission should contain a docker im- GIR), and is co-organizer of SIGIR eCom. Jacopo was the
age that runs the local evaluation script and produces the lead organizer of the SIGIR Data Challenge 2021,
speardesired output within the maximum allotted time on the heading the release of the largest session-based dataset
target cloud machine. Please check E v a l R S repository for for eCommerce research.
the exact final requirements and up-to-date instructions.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.1. Structure and timeline</title>
        <sec id="sec-3-2-1">
          <title>E v a l R S unfolds in three main phases:</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Organization, Community,</title>
    </sec>
    <sec id="sec-5">
      <title>Impact</title>
      <sec id="sec-5-1">
        <title>Federico Bianchi Federico Bianchi is a postdoctoral</title>
        <p>researcher at Stanford University. He obtained his Ph.D.
in Computer Science at the University of Milano-Bicocca
in 2020. His research, ranging from Natural Language
Processing methods for textual analytics to recommender
systems for the e-commerce has been accepted to major
NLP and AI conferences (EACL, NAACL, EMNLP, ACL,</p>
        <p>AAAI, RecSys) and journals (Cognitive Science, Applied
1. CHALLENGE: An open challenge phase, where Intelligence, Semantic Web Journals). He co-organized
participating teams register for the challenge and the SIGIR Data Challenge 2021. He frequently releases his
work on improving the scores on both standard research as open-source tools that have collected almost
and behavioral metrics across the two phases ex- a thousand GitHub stars and been downloaded over 100
plained above (3.2.4). thousand times.
2. CFP: A call for papers, where teams submit a
written contribution, describing their system, custom Tobias Schnabel Tobias Schnabel is a senior
retesting, data insights. searcher in the Productivity+Intelligence group at
Mi3. CONFERENCE: At the conference, winners will crosoft Research. He is interested in improving
humanbe announced and special prizes for novel testings facing machine learning systems in an integrated way,
and oustanding student work will be awarded. considering not only algorithmic but also human
facDuring the workshop, we plan to discuss solicited tors. To this end, his research draws from causal
inpapers and host a round-table with experts on RSs ference, reinforcement learning, machine learning, HCI,
evaluation. and decision-making under uncertainty. He was a
coorganizer for a WSDM workshop this year and has served
as (senior) PC member for a wide array of AI and data
science conference (ICML, NeurIPS, WSDM, KDD).
Before joining Microsoft, he obtained Ph.D. from the
Computer Science Department at Cornell University under
Thorsten Joachims.</p>
        <p>Giuseppe Attanasio Giuseppe Attanasio is a
postdoctoral researcher at Bocconi, where he works on
largescale neural architectures for Natural Language
Processing. His research focuses on understanding and
regularizing models for debiasing and fairness purposes. His
research on the topic has been accepted to major NLP
conferences (ACL). While working at Bocconi, he is
concluding his Ph.D. at the Department of Control and
Computer Engineering at Politecnico di Torino.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Ciro Greco Ciro Greco was the co-founder and CEO of</title>
        <p>Tooso, a San Francisco based startup specialized in
Information Retrieval. Tooso was acquired in 2019 by Coveo,
where he now works as VP or Artificial Intelligence.He
holds a Ph.D. in Linguistics and Cognitive Neuroscience
at Milano-Bicocca. He worked as visiting scholar at MIT
and as a post-doctoral fellow at Ghent University. He
published extensively in top-tier conferences (including</p>
      </sec>
      <sec id="sec-5-3">
        <title>Our CFP takes a “design paper” perspective, where</title>
        <p>teams are invited to discuss both how they adapted their
initial model to take into account the test suite, and how
the tests strengthened their understanding of the target
dataset and use case6.</p>
        <p>We emphasize the CFP and CONFERENCE steps as
moments to share with the community additional tests, error
analysis and data insights inspired by E v a l R S . By
leveraging RecList, we not only enable teams to quickly iterate
starting from our ideas, but we promise to immediately
circulate in the community their testing contribution
through a popular open source package. Finally, we plan
on using CEUR-WS to publish the accepted papers, as
well as drafting a final public report as an additional,
actionable artifacts from the challenge.</p>
        <sec id="sec-5-3-1">
          <title>4.2. Organizers</title>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>Jacopo Tagliabue Jacopo Tagliabue was co-founder</title>
        <p>of Tooso, an Information Retrieval company acquired by
Coveo in 2019. As Director of AI at Coveo, he divides his
time between product, research, and evangelization: he</p>
      </sec>
      <sec id="sec-5-5">
        <title>6As customary in these events, we will involve a small committee</title>
        <p>from top-tier practitioners and scholars to ensure the quality of the
ifnal submissions.</p>
        <p>NAACL, ACL, RecSys, SIGIR) and scientific journals (The the evaluation of RSs and fairness; second, researchers
Linguistic Review, Cognitive Science, Nature Commu- who proposed a new model and desire to test its
genernications). He was also co-organizer of the SIGIR Data alization abilities on new metrics; third, industrial
pracChallenge 2021. titioners that started using R e c L i s t after its release in
recent months, and already signaled strong support for
Gabriel de Souza P. Moreira Gabriel Moreira is a Sr. behavioral testing in their real-world use cases.
Applied Research Scientist at NVIDIA, leading the re- E v a l R S makes a novel and significant contribution to
search eforts of Merlin research team. He had his PhD the community: first, we ask practitioners to “live and
degree from ITA university, Brazil, with a focus on Deep breath” the problem of evaluation, operationalizing
prinLearning for RecSys and Session-based recommendation. ciples and insights through sharable code; second, we
Before joining NVIDIA, he was lead Data Scientist at embrace a “build in the open” approach, as all artifacts
CI&amp;T for 5 years, after working as software engineer for from the event will be available to the community as
more than a decade. In 2019, he was recognized as a a permanent contribution, in the form of open source
Google Developer Expert (GDE) for Machine Learning. code, design papers, and public documentation – through
He was part of the NVIDIA teams that won recent Rec- prizes assigned based on scores, but also outstanding
Sys competitions: ACM RecSys Challenge 2020, WSDM testing and paper contributions, and special awards for
WebTour Workshop Challenge 2021 by Booking.com and students, we hope to actively encourage more
practitionthe SIGIR eCommerce Workshop Data Challenge 2021 ers to join the evaluation debate and get a more diverse
by Coveo. set of perspectives for our workshop.
As argued throughout this paper, when comparing
Patrick John Chia Patrick John Chia is an Applied E v a l R S methodology to typical data challenges, we can
Scientist at Coveo. Prior to this, he completed his Mas- summarize three important diferentiating factors: first ,
ter’s degree at Imperial College London and spent a year we fight public leaderboard overfitting through our
ranat Massachusetts Institute of Technology (MIT). He was domized evaluation loop; second, we discourage complex
co-organizer of the 2021 SIGIR Data Challenge and has solutions that cannot be practically used, as our open
been a speaker on topics at the intersection of Machine source code competition provides a fixed (and
reasonLearning and eCommerce (SIGIR eCom, ECNLP at ACL). able) compute budget; third and most importantly, with
His latest interests lie in developing AI that has the ability a thorough evaluation with per-group and behavioral
to learn like infants and applying it to creating solutions tests, we encourage participants to seek non-standard
at Coveo. performance and discuss fairness implications.</p>
        <p>We strongly believe these points will lay down the
foundation for a first-of-its-kind automatic, shared,
identifiable evaluation standard for RSs.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Similar Events and Broader</title>
    </sec>
    <sec id="sec-7">
      <title>Outlook</title>
      <p>The CIKM-related community has shown great interest
in themes at the intersection of aligning machine
learning with human judgment, rigorous evaluation settings,
and fairness, as witnessed by popular Data Challenges
and important workshops in top-tier venues. Among
recent challenges, the 2021 SIGIR-Ecom Data Challenge,
the 2021 Booking Data Challenge, and the 2020 RecSys
Challenge are all events centered around the evaluation
of RSs, yet still substantially diferent: for example, the
SIGIR Challenge focused on MRR as a success metric [10],
while the Booking Challenge [38] used top-k accuracy.</p>
      <p>Moreover, the growing interest for rounded evaluation
led to the creation of many interesting workshops in
recent years, such as IntRS: Joint Workshop on Interfaces and
Human Decision Making for Recommender Systems,
ImpactRS: Workshop on the Impact of Recommender Systems
and FAccTRec: Workshop on Responsible Recommendation.
For this reason, we expect this challenge to attract a
diverse set of practitioners: first, researchers interested in</p>
    </sec>
    <sec id="sec-8">
      <title>6. ACKNOWLEDGEMENTS</title>
      <sec id="sec-8-1">
        <title>R e c L i s t is an open source library whose development is</title>
        <p>supported by forward looking companies in the machine
learning community: the organizers wish to thank Comet,
Neptune, Gantry for their generous support.7</p>
      </sec>
      <sec id="sec-8-2">
        <title>7Please check the project website for more details: https://reclist.io/.</title>
        <p>Computing Machinery, New York, NY, USA, 2016, [11] F. M. Harper, J. A. Konstan, The movielens datasets:
p. 103–110. URL: https://doi.org/10.1145/2911996. History and context, ACM Trans. Interact. Intell.
2912004. doi:1 0 . 1 1 4 5 / 2 9 1 1 9 9 6 . 2 9 1 2 0 0 4 . Syst. 5 (2015). URL: https://doi.org/10.1145/2827872.
[3] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we doi:1 0 . 1 1 4 5 / 2 8 2 7 8 7 2 .</p>
        <p>really making much progress? a worrying analy- [12] H. Zamani, M. Schedl, P. Lamere, C.-W. Chen, An
sis of recent neural recommendation approaches, analysis of approaches taken in the acm recsys
chalin: Proceedings of the 13th ACM Conference on lenge 2018 for automatic music playlist
continuaRecommender Systems, RecSys ’19, Association for tion, ACM Trans. Intell. Syst. Technol. 10 (2019).
Computing Machinery, New York, NY, USA, 2019, URL: https://doi.org/10.1145/3344257. doi:1 0 . 1 1 4 5 /
p. 101–109. URL: https://doi.org/10.1145/3298689. 3 3 4 4 2 5 7 .</p>
        <p>3347058. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 5 8 . [13] D. Kotkov, J. Veijalainen, S. Wang, Challenges of
[4] Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang, serendipity in recommender systems, in: WEBIST,
C. Geng, Are we evaluating rigorously? bench- 2016.
marking recommendation for reproducible evalua- [14] D. Jannach, M. Ludewig, When recurrent neural
tion and fair comparison, in: Fourteenth ACM con- networks meet the neighborhood for session-based
ference on recommender systems, 2020, pp. 23–32. recommendation, in: Proceedings of the Eleventh
[5] X. Wang, X. He, M. Wang, F. Feng, T.-S. Chua, Neu- ACM Conference on Recommender Systems, 2017,
ral graph collaborative filtering, in: Proceedings of pp. 306–310.
the 42nd international ACM SIGIR conference on [15] M. Ludewig, D. Jannach, Evaluation of
sessionResearch and development in Information Retrieval, based recommendation algorithms, User Modeling
2019, pp. 165–174. and User-Adapted Interaction 28 (2018) 331–390.
[6] A. Rashed, S. Jawed, L. Schmidt-Thieme, [16] M. T. Ribeiro, T. S. Wu, C. Guestrin, S. Singh,
BeA. Hintsches, Multirec: A multi-relational yond accuracy: Behavioral testing of nlp models
approach for unique item recommendation in with checklist, in: ACL, 2020.
auction systems, Fourteenth ACM Conference on [17] G. d. S. P. Moreira, S. Rabhi, J. M. Lee, R. Ak,
Recommender Systems (2020). E. Oldridge, Transformers4rec: Bridging the gap
[7] P. Kouki, I. Fountalis, N. Vasiloglou, X. Cui, E. Lib- between nlp and sequential/session-based
recomerty, K. Al Jadda, From the lab to production: A mendation, in: Fifteenth ACM Conference on
Reccase study of session-based recommendations in the ommender Systems, 2021, pp. 143–153.
home-improvement domain, in: Fourteenth ACM [18] K. Ariu, N. Ryu, S. Yun, A. Proutière,
ReConference on Recommender Systems, RecSys gret in online recommendation systems, ArXiv
’20, Association for Computing Machinery, New abs/2010.12363 (2020).</p>
        <p>York, NY, USA, 2020, p. 140–149. URL: https://doi. [19] J. Tagliabue, B. Yu, F. Bianchi, The Embeddings
org/10.1145/3383313.3412235. doi:1 0 . 1 1 4 5 / 3 3 8 3 3 1 3 . That Came in From the Cold: Improving Vectors
3 4 1 2 2 3 5 . for New and Rare Products with Content-Based
[8] T. Moins, D. Aloise, S. J. Blanchard, Recseats: Inference, Association for Computing Machinery,
A hybrid convolutional neural network choice New York, NY, USA, 2020, p. 577–578. URL: https:
model for seat recommendations at reserved seat- //doi.org/10.1145/3383313.3411477.
ing venues, in: Fourteenth ACM Conference on [20] L. Briand, G. Salha-Galvan, W. Bendada, M.
MorRecommender Systems, RecSys ’20, Association for lon, V.-A. Tran, A semi-personalized system for
Computing Machinery, New York, NY, USA, 2020, user cold start recommendation on music
streamp. 309–317. URL: https://doi.org/10.1145/3383313. ing apps, 2020. URL: arXiv:2106.03819.
3412263. doi:1 0 . 1 1 4 5 / 3 3 8 3 3 1 3 . 3 4 1 2 2 6 3 . [21] M. Hendriksen, E. Kuiper, P. Nauts, S. Schelter,
[9] F. Bianchi, J. Tagliabue, B. Yu, Query2Prod2Vec: M. de Rijke, Analyzing and predicting purchase
Grounded word embeddings for eCommerce, in: intent in e-commerce: Anonymous vs. identified
Proceedings of the 2021 Conference of the North customers, 2020. URL: https://arxiv.org/abs/2012.
American Chapter of the Association for Computa- 08777.
tional Linguistics: Human Language Technologies: [22] Krista Garcia, The impact of product
recommenIndustry Papers, Association for Computational dations, 2018. URL: https://www.emarketer.com/
Linguistics, Online, 2021, pp. 154–162. URL: https: content/the-impact-of-product-recommendations.
//aclanthology.org/2021.naacl-industry.20. doi:1 0 . [23] A. Flexer, D. Schnitzer, J. Schlueter, A mirex
meta1 8 6 5 3 / v 1 / 2 0 2 1 . n a a c l - i n d u s t r y . 2 0 . analysis of hubness in audio music similarity, 2012.
[10] J. Tagliabue, C. Greco, J.-F. Roy, F. Bianchi, G. Cas- [24] M. Twohey, G. J. Dance, Lawmakers press amazon
sani, B. Yu, P. J. Chia, Sigir 2021 e-commerce work- on sales of chemical used in suicides, 2022. URL:
shop data challenge, in: SIGIR eCom 2021, 2021. https://www.nytimes.com/2022/02/04/technology/
amazon-suicide-poison-preservative.html.
[25] J. Tagliabue, You Do Not Need a Bigger Boat:
Recommendations at Reasonable Scale in a (Mostly)
Serverless and Open Stack, Association for
Computing Machinery, New York, NY, USA, 2021,
p. 598–600. URL: https://doi.org/10.1145/3460231.</p>
        <p>3474604.
[26] V. Batagelj, M. Zaveršnik, Generalized cores,
Advances in Data Analysis and Classification 5 (2011)
129–145.
[27] M. Schedl, Investigating country-specific music
preferences and music recommendation algorithms
with the lfm-1b dataset, International Journal of</p>
        <p>Multimedia Information Retrieval 6 (2017) 71 – 84.
[28] J. S. Ke Yang, Measuring fairness in ranked
outputs, in: SSDBM 2017: Proceedings of the 29th
International Conference on Scientific and
Statistical Database Management, 2017, pp. 1 – 6. URL:
https://doi.org/10.1145/3085504.3085526.
[29] C. Castillo, Fairness and transparency in
ranking, in: ACM SIGIR ForumVolume, volume Volume
52, 2019, pp. 64 – 71. URL: https://doi.org/10.1145/
3308774.3308783.
[30] M. Zehlike, K. Yang, J. Stoyanovich, Fairness in
ranking: A survey, in: TBD. ACM, 2020, pp. 1–58.</p>
        <p>URL: https://arxiv.org/pdf/2103.14000.pdf.
[31] M. O’Mahony, N. Hurley, N. Kushmerick, G.
Silvestre, Collaborative recommendation: A
robustness analysis, volume 4, 2004. URL: https://doi.org/
10.1145/1031114.1031116.
[32] S. Saxena, S. Jain, Exploring and mitigating gender
bias in recommender systems with explicit feedback
(2021). URL: arXivpreprintarXiv:2112.02530.
[33] D. Kowald, M. Schedl, E. Lex, The unfairness of
popularity bias in music recommendation: A
reproducibility study, European conference on
information retrieval (2020).
[34] Òscar Celma, P. Cano, From hits to niches? or how
popular artists can bias music recommendation
and discovery, in: Proceedings of the 2nd KDD
Workshop on Large-Scale Recommender Systems
and the Netflix Prize Competition, 2008. URL:
https://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.168.5009&amp;rep=rep1&amp;type=pdf.
[35] P. Bello, D. Garcia, Cultural divergence in popular
music: the increasing diversity of music
consumption on spotify across countries, Humanities and</p>
        <p>Social Sciences Communications 8 (2021).
[36] M. Drosou, H. Jagadish, E. Pitoura, J. Stoyanovich,</p>
        <p>Diversity in big data: A review, Big data 5.2 (2017)
73–84.
[37] Diversity in recommender systems – a survey,</p>
        <p>Knowledge-Based Systems (2017) 154–162.
[38] M. Baigorria Alonso, Data augmentation
using many-to-many rnns for session-aware
recommender systems, in: ACM WSDM Workshop on
Web Tourism (WSDM WebTour’21), 2021.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Chia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tagliabue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <surname>Beyond</surname>
            <given-names>NDCG</given-names>
          </string-name>
          :
          <article-title>behavioral testing of recommender systems with reclist</article-title>
          ,
          <source>CoRR abs/2111</source>
          .09963 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/ 2111.09963.
          <article-title>a r X i v : 2 1 1 1 . 0 9 9 6 3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <article-title>The lfm-1b dataset for music retrieval and recommendation</article-title>
          ,
          <source>in: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval</source>
          , ICMR '16, Association for
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>