<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM Conference on Recommender Systems, Amsterdam, The Netherlands
" ngoziihemelandu@u.boisestate.edu (N. Ihemelandu); michaelekstrand@boisestate.edu (M. D. Ekstrand)
~ https://md.ekstrandom.net/ (M. D. Ekstrand)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ngozi Ihemelandu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael D. Ekstrand</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>This paper calls attention to the missing component of the recommender system evaluation process: Statistical Inference. There is active research in several components of the recommender system evaluation process: selecting baselines, standardizing benchmarks, and target item sampling. However, there has not yet been significant work on the role and use of statistical inference for analyzing recommender system evaluation results. In this paper, we argue that the use of statistical inference is a key component of the evaluation process that has not been given suficient attention. We support this argument with systematic review of recent RecSys papers to understand how statistical inference is currently being used, along with a brief survey of studies that have been done on the use of statistical inference in the information retrieval community. We present several challenges that exist for inference in recommendation experiment which buttresses the need for empirical studies to aid with appropriately selecting and applying statistical inference techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LATEX class</kwd>
        <kwd>Evaluation</kwd>
        <kwd>statistical inference</kwd>
        <kwd>significance tests</kwd>
        <kwd>significant results</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        It is widely recognized that the use of appropriate statistical inference techniques should be
used to analyze, interpret, and report the results of evaluations and experiments, including
evaluations of recommender systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These techniques come in many forms, including
point estimation, interval estimation, and hypothesis testing, but analysis needs to go beyond
merely computing metrics to determine if observed metrics represent genuine efects. 1 In
this paper we consider the state of statistical inference in recommender systems evaluation,
arguing that identifying and documenting best practices for statistical analysis is a vital and
oft-overlooked component of the discussion on how to improve the rigor, reproducibility, and
reliability of recommender systems evaluation results.
https://piret.info/pubs/2021/
      </p>
      <p>We focus primarily on statistical inference for one of the most common goals of recommender
systems research: to demonstrate an improvement in efectiveness over the current state of the
art. This could be by developing a new recommendation technique that is more efective at
some recommendation tasks than previously-known techniques, or by modifying an existing
approach. To assess if the measured improvement of the new method over the state-of-the-art
is substantial and not just a result of random chance, we typically use a hypothesis test (null
hypothesis significance testing, or NHST) for the null hypothesis that there is no diference
between the two methods’ efectiveness; sometimes confidence intervals or Bayesian inference
techniques may be employed instead of or in addition to an NHST.</p>
      <p>
        There has been significant research on evaluation strategies for this research goal. Dacrema
et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] showed in their systematic analysis of deep learning approaches for top-
recommendation tasks that many claims of improved performance over a baseline may be illusory. There
are many design points in a recommender experiment that can afect its rigor and reliability;
Dacrema et al. focused specifically on the choice and tuning of baselines in the evaluation
process. They found that many measured improvements disappear when the baseline algorithms
are properly tuned: that is, better choice of hyperparameters and model options can cause the
baseline to perform just as well as the proposed new method.
      </p>
      <p>
        Other authors have considered the efects and sought to develop best practices for other
design choices in an evaluation. Rendle et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] argue for standardized benchmarks, by which
they mean datasets with well-defined train–test splits and evaluation protocols for specific
tasks (e.g. prediction). They state that although well-defined benchmarks exist for comparing
prediction algorithms, there are not standardized benchmarks for other tasks such as ranking.
They argue that empirical findings reported in research papers are questionable unless they
were obtained on standardized benchmarks where — as recommended by Dacrema et al. —
baselines have been tuned extensively. Cañamares and Castells [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] bring attention to an ofline
evaluation setup component — target item sampling — that is not always explicit and has
received little attention in the quest for seeking an evaluation procedure. They show that
diferent target subsets can lead to diferent evaluation outcomes. Sun et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] work shed light
on the issues – unreproducible evaluation and unfair comparison – which they attribute to
the unavailability of efective benchmarks for evaluation. They investigated the evaluation
rigorousness (reproducibility and fairness) in recommendation by analyzing the influence of
diferent factors on recommendation performance through a holistic empirical study. The result
of their study corroborates the findings of Dacrema et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>However, there has not yet been much attention to appropriately selecting and applying
statistical inference techniques to the metrics that result from these evaluations. Shani and
Gunawardana [6] discuss general ways of performing significance testing using widely-known
statistical methods, but to our knowledge there have not yet been empirical studies on the
use of statistical inference for analyzing evaluation results, as there has been for TREC-style
search experiments (see Section 3). Evidence-based guidance on best practices for analyzing
and reporting results is therefore lacking. The current use, or lack thereof, of various techniques
for recommender system experimental results is also an open question.</p>
      <p>Our central claim in this paper is that the RecSys community does not currently pay suficient
attention to the choice and use of statistical techniques, and discussions such as the one at
this workshop needs to consider the role of inference and develop best practices for rigorous
analysis of evaluation results. We support this argument with a systematic review of recent
RecSys papers to understand how statistical inference is currently being used, along with a
brief survey of studies that have been done on the use of statistical inference in the information
retrieval community (particularly for analyzing TREC search efectiveness metrics). We identify
several challenges that exist for inference in recommendation experiments, and call on the
community to attend to this issue and work with us to fill this important gap in the literature
on reliable evaluation of recommender systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Systematic Review of Statistical Inference in RecSys</title>
      <p>We begin by assessing current practices in statistical inference for recommender system
evaluations. Our study is inspired by that of Sakai [7], who conducted a systematic review of 840 SIGIR
full papers and 215 TOIS papers published between 2006 and 2015. Their goal was to identify
what types of statistical test IR researchers use, how they report or fail to report on significance
test results, and how the reporting practices may have changed over the last decade.</p>
      <p>They found that of the 862 papers selected for the survey about 28-30% do not report
significance test results; for the comparison of two IR systems, 61-66% of these papers use the paired
-test; 20-23% use the Wilcoxon signed rank test; 4-5% use the randomisation test; 3-4% use
the sign test; and 1% use the bootstrap test. They also found that the paired -test was more
common in recent years while Wilcoxon test decreased in popularity.</p>
      <p>To get a first look at current RecSys statistical practices, we conducted a systematic review
only for long and short RecSys papers that proposed new or enhanced algorithmic methods and
compared their performance to that of baselines (state-of-art). Hence, proposed new methods
that were not compared to baselines were not selected. Our survey is limited to papers published
in 2019 and 2020.</p>
      <sec id="sec-2-1">
        <title>2.1. Survey Methods</title>
        <p>The main focus of this systematic survey is to examine how statistical significance tests are
used by researchers working on papers proposing new or enhanced recommender algorithms.</p>
        <p>We selected full and short papers from RecSys 2019–2020 that meet the following criteria:
• The paper proposed a new or enhanced algorithmic method for some recommendation
task.</p>
        <p>• The efectiveness scores for the baselines and new/enhanced method were recorded.</p>
        <p>We coded the selected papers as specified below (The coding was done in the listed order.
That is, if the paper does not meet the first criteria, the second criteria is checked etc.):
Used specified test The paper mentioned the name of the test used along with the significance
level ( ) or -value. We also recorded which test it used.</p>
        <p>Used confidence interval The paper reported confidence intervals or indicated the standard
error for the estimated metric scores for the new method as well as the baseline.</p>
        <p>Used unspecified test The paper did not specify which test was used but claimed statistical
significance or specified -value &lt; significance level (  ) or the calculated test statistics.
No significance test The paper did not seem to test the results for significance.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Results and Discussion</title>
        <p>Out of the 146 RecSys long and short papers examined, we found 111 papers that proposed new
or enhanced recommender algorithms for which we expect significance testing to be used to
analyze the evaluation result. See Table 1 for the break down of the selected papers by year.</p>
        <p>Table 2 shows the classification of the 111 selected papers, and Fig. 1 shows the distribution
by the test type of the set of selected papers labeled as “used significance tests”. We found that
over half of the papers proposing a new algorithmic method did not seem to use any significance
test to analyze their evaluation results; a substantial portion of those who claim significance did
not specify a test.</p>
        <p>These results show that there is currently a lack of rigorous statistical analysis and reporting
in the evaluations published in RecSys. While we do not have an explanation as to why there is
this gap, we believe it needs to be filled if we are to go from observed diferences in metrics to
reliable knowledge.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Statistical Inference in Information Retrieval</title>
      <p>In addition to Sakai’s study of existing practice, several studies in the information retrieval (IR)
community have addressed the use of statistical inference for system comparison experiment,
attempting to identify which statistical techniques are appropriate to use for the analysis of
the evaluation results in IR systems comparison, particularly for the results of TREC-style
experiments.</p>
      <p>Smucker et al. [8] used results from historical TREC runs to study the agreement between
diferent pairwise significance tests. Using root mean squared error (RMSE) to compare the
-values produced by five diferent tests, they found that the randomization, bootstrap, and
-tests all agreed with each other (producing very similar -values) while the Wilcoxon and
sign tests neither agreed with the other tests nor each other. They then used the randomization
test as ground truth to estimate the false positive and false negative rates of the Wilcoxon and
Sign tests, finding that both tests have high false positive and false negative rates when the
diference in system efectiveness (evaluation metric) is small. They recommend that researchers
wanting a distribution-free test should use the randomization test with the test statistic of their
choice, and recommended discontinuing use of the Wilcoxon or sign tests for IR evaluation data
analysis.</p>
      <p>Urbano et al. [9] and Parapar et al. [10] used simulations to produce per-topic evaluation
scores rather than directly using the recorded metrics. They fit generative probabilistic models
to the metric distributions from historic TREC runs (to ensure realism) and sampled from these
models, allowing them to directly control the actual diference (or lack thereof) between systems
and measure the error rates of diferent statistical tests. One of the key diferences in their
approaches is the simulation architecture: Parapar et al. [10] simulated the utility of individual
retrieved documents, while Urbano et al. [9] modeled the joint distribution between pairs of
efectiveness scores. Both simulation designs enabled them to directly assess the accuracy of
the p-values produced by the various significance tests, and to measure their false positive rates
and statistical power.</p>
      <p>Urbano et al. [9] found that the Wilcoxon and sign tests have more false positives than
expected, especially at low significance levels, and that this error is more pronounced as the
sample size increases. The bootstrap test exhibits similar behavior (making more false positive
rates than expected) with small sample sizes but starts behaving as expected as the sample size
increases. The randomization test behaves better than the bootstrap, Wilcoxon and sign tests and
approaches the expected behavior as the sample size increases. The -test behaves as expected
even for small sample size. They also found that for large sample sizes the randomization,
bootstrap and -test all agree, concurring with the results of Smucker et al. [8].</p>
      <p>They also found that the sign test is consistently less powerful than other tests while the
bootstrap test is usually the most powerful, especially with small samples. With large sample
sizes, all tests except the sign tests exhibited nearly-identical power. Since the -test was
wellbehaved as in terms of both the false positive rate (even for small samples) and power, Urbano
et al. recommend its use as the best choice for mean efectiveness in IR evaluations, and the
randomization test for test statistics other than the mean. Like Smucker et al., they discourage
use of the Wilcoxon and sign tests for IR evaluation results.</p>
      <p>Parapar et al. [10] came to diferent conclusions than Urbano et al.. Their simulations showed
that the Wilcoxon and randomization tests have the expected false-positive rate behavior while
the -test, the sign test, and the bootstrap did not behave as expected. They also found that the
sign test and Wilcoxon test have more statistical power than the other tests. Therefore they
recommend the use of the sign test and Wilcoxon test for the analysis of IR evaluation results.</p>
      <p>All three papers had the goal of producing recommendations for appropriate significance
tests to apply when comparing IR systems. Smucker et al. [8] and Urbano et al. [9] made
recommendations that were similar, while Parapar et al. [10] arrived at a completely diferent
recommendation.</p>
      <p>Both Urbano et al. [11] and Parapar et al. [12] have followed up and attempted to understand
this discrepancy in their conclusions, but there is not yet clarity on which is the more reliable
recommendation.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Gaps for Fixing RecSys Evaluation Practice</title>
      <p>Whichever evidence produces the more reliable recommendation for IR evaluation settings
studied in the previous section, it may not be feasible to just apply that recommendation to
RecSys evaluation. There are some key diferences between TREC ad-hoc retrieval evaluation
and the recommender system evaluation. Some of these key diferences — which do not only
afect recommendation, as many are shared with actual deployments of search engines outside
the TREC context — include:
• The sample size of the test collection in a traditional TREC Cranfield experiment is quite
small — often 50 topics, particularly in the data sets studied — while the typical sample
size of a RecSys evaluation is &gt; 1, 000 users.
• In typical RecSys evaluation data, a few items are known to be relevant to many users,
resulting in a long-tailed distribution of user ratings over items. This is in contrast with
TREC evaluation where documents are not concentrated to just a few queries.
• In TREC evaluation, the ground-truth relevance judgement which are assumed to be
(approximately) complete, while the user feedback used in RecSys evaluations form a
sparse and highly incomplete picture of item-user relevance.</p>
      <p>We want to call particular attention to sample size, as it is a key factor that impacts the
statistical power of a significance test (the ability of the test to detect significance in the
presence of a real efect). The statistical power of a significance test increases as the sample size
increases; therefore, by increasing the sample size, any measured improvement can be found to
be significant by any significance test even when the size of the measured improvement is so
small that it is not operationally meaningful.</p>
      <p>
        Statistical biases are another factor that may influence the outcome of significance test for
RecSys evaluation data. It has become well known that biases such as sparsity and popularity
biases in RecSys evaluation data considerably distort the evaluation measures [
        <xref ref-type="bibr" rid="ref4">13, 14, 15, 16, 4</xref>
        ].
Bellogín et al. [17] showed that the long-tailed distribution of RecSys evaluation data has a
drastic efect on how recommendation algorithms compare to each other. The hypothesis test
does not account for these biases hence, this distortion can ultimately influence its outcome. It
isn’t clear whether this should be fixed as a part of inference, or as a corrective stage before or
after inference, but it remains a gap in the ability to accurately evaluate system performance
that needs to be addressed.
      </p>
      <p>There are also on-going discussions on the inadequacies of statistical significance testing.
McShane et al. [18] states that the widespread crisis in the biomedical and social sciences with
published findings failing to replicate at an alarming rate maybe associated with claims of
huge efects from tiny interventions, citing  &lt; 0.05 as the primary evidence. A group of
72 researchers representing a wide range of disciplines (psychology, economics, sociology,
anthropology, medicine, epidemiology, ecology, and philosophy) and statistical perspectives
have proposed a change in the -value threshold for a “statistically significant” result from 0.05
to 0.005 for claims of discoveries of novel efects [ 19]. They recommend that results currently
called “statistically significant” that do not meet the new threshold would be called suggestive
and treated as ambiguous as to whether there is an efect. However, McShane et al. [18] state
that this proposal is insuficient to overcome the current crisis with the inability to replicate
experiment results. They recommend abandoning the null hypothesis significance testing
paradigm entirely and just use -values as one of many pieces of information to cite as evidence
for a novel efect claim.</p>
      <p>Translating this discussion back to information retrieval, Sakai [20] recognizes that statistical
significance testing is not enough and provides suggestions on how IR researchers should report
efect sizes and confidence intervals along with -values, in the context of comparing IR systems
using test collections. Carterette [21] advocates for the use of the -test even though their
analysis showed that a -value cannot have any objective meaning. They believe it is still useful
for many of the purposes they are currently used for. They however, recommend that in the
long term, IR experimental analysis should transition to a fully Bayesian modeling approach.</p>
      <p>We raise these points to observe that even if we can identify efective and appropriate
hypothesis tests for typical ofline evaluation metrics, that does not fully address the goal of
inferring whether or not a proposed system is actually more efective; additional sources of bias
need to be accounted for, and it is not clear that NHST is the best framework for evaluating
results.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Challenges of Statistical Inference and Next Steps</title>
      <p>
        It is important that statistical inference results are reported with all necessary details in order to
make research papers as informative as possible, and to estimate and give the reader confidence
in understanding the credibility and impact of a reported improvement. However, this is not
the prevalent current practice in the RecSys community, as demonstrated by the results in
Section 2. While there has been significant attention paid to other aspects of the evaluation
process [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ], and the IR community has studied inference for certain experimental settings
(see Section 3), this aspect has not yet been a noticeable part of the scholarly discourse on
evaluation practices. We argue that this gap needs to be filled.
      </p>
      <p>As a first step, we propose that researchers should report clearly how they performed inference
on their results, with multiple results as appropriate. For example, studies using frequentist
significance testing should report the test used, the -value threshold, any corrections for
multiple comparisions, and also the efect size and confidence interval, in order to help readers
fully understand and better apply the findings. Reporting efect size and sample size help
to make papers as informative as possible. While further research is needed to identify best
practices for selecting and applying techniques, research using current practices should clearly
document them.</p>
      <p>
        We believe further research is needed to identify best practices for applying and reporting on
classical tests and techniques, and to study how more advanced inference techniques may be
able to mitigate some of their limitations. One such advanced technique that could be studied
in the recommender system context is the mixed efect model for testing significance of efects,
or Bayesian inference techniques for computing and summarizing posterior distributions of
efect sizes. We also believe that, as the community continues work towards documented best
practices, and has discussed in the past the need to lay out recommended methods for the
benefit of authors, reviewers, and editors [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], such practices need to include recommendations
for statistical techniques. The community may be ready to make some such recommendations
now, but we call for further research to provide empirical evidence for the appropriateness of
recommended techniques, and for such guidelines to leave the door open for innovation in
statistical analysis of recommender system evaluations, at least so long as the direction of this
innovation is towards greater understanding and rigor.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work partially supported by the National Science Foundation under Grant IIS 17-51278.
benchmarking recommendation for reproducible evaluation and fair comparison, in:
Fourteenth ACM Conference on Recommender Systems, 2020, pp. 23–32.
[6] G. Shani, A. Gunawardana, Evaluating recommendation systems, in: Recommender
systems handbook, Springer, 2011, pp. 257–297.
[7] T. Sakai, Statistical significance, power, and sample sizes: A systematic review of sigir
and tois, 2006-2015, in: Proceedings of the 39th International ACM SIGIR conference on
Research and Development in Information Retrieval, 2016, pp. 5–14.
[8] M. D. Smucker, J. Allan, B. Carterette, A comparison of statistical significance tests for
information retrieval evaluation, in: Proceedings of the sixteenth ACM conference on
Conference on information and knowledge management, 2007, pp. 623–632.
[9] J. Urbano, H. Lima, A. Hanjalic, Statistical significance testing in information retrieval: an
empirical analysis of type i, type ii and type iii errors, in: Proceedings of the 42nd
International ACM SIGIR Conference on Research and Development in Information Retrieval,
2019, pp. 505–514.
[10] J. Parapar, D. E. Losada, M. A. Presedo-Quindimil, A. Barreiro, Using score distributions to
compare statistical significance tests for information retrieval evaluation, Journal of the
Association for Information Science and Technology 71 (2020) 98–113.
[11] J. Urbano, M. Corsi, A. Hanjalic, How do metric score distributions afect the type i error
rate of statistical significance tests in information retrieval?, in: Conference on the Theory
of Information Retrieval (ICTIR’21), 2021.
[12] J. Parapar, D. E. Losada, Á. Barreiro, Testing the tests: simulation of rankings to compare
statistical significance tests in information retrieval evaluation, in: Proceedings of the 36th
Annual ACM Symposium on Applied Computing, 2021, pp. 655–664.
[13] M. D. Ekstrand, V. Mahant, Sturgeon and the cool kids: Problems with random decoys for
top-n recommender evaluation, in: The Thirtieth International Flairs Conference, 2017.
[14] M. Tian, M. D. Ekstrand, Estimating error and bias in ofline evaluation results, in:
Proceedings of the 2020 Conference on Human Information Interaction and Retrieval,
2020, pp. 392–396.
[15] R. Cañamares, P. Castells, A probabilistic reformulation of memory-based collaborative
ifltering: Implications on popularity biases, in: Proceedings of the 40th International
ACM SIGIR Conference on Research and Development in Information Retrieval, 2017, pp.
215–224.
[16] R. Cañamares, P. Castells, Should i follow the crowd? a probabilistic analysis of the
efectiveness of popularity in recommender systems, in: The 41st International ACM SIGIR
Conference on Research &amp; Development in Information Retrieval, 2018, pp. 415–424.
[17] A. Bellogín, P. Castells, I. Cantador, Statistical biases in information retrieval metrics for
recommender systems, Information Retrieval Journal 20 (2017) 606–634.
[18] B. B. McShane, D. Gal, A. Gelman, C. Robert, J. L. Tackett, Abandon statistical significance,</p>
      <p>The American Statistician 73 (2019) 235–245.
[19] D. J. Benjamin, J. O. Berger, M. Johannesson, B. A. Nosek, E.-J. Wagenmakers, R. Berk, K. A.</p>
      <p>Bollen, B. Brembs, L. Brown, C. Camerer, et al., Redefine statistical significance, Nature
human behaviour 2 (2018) 6–10.
[20] T. Sakai, Statistical reform in information retrieval?, in: ACM SIGIR Forum, volume 48,</p>
      <p>ACM New York, NY, USA, 2014, pp. 3–12.
[21] B. A. Carterette, Multiple testing in statistical analysis of systems-based information
retrieval experiments, ACM Transactions on Information Systems (TOIS) 30 (2012) 1–34.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          , G. Adomavicius,
          <article-title>Toward identification and adoption of best practices in algorithmic recommender systems research</article-title>
          ,
          <source>in: Proceedings of the international workshop on Reproducibility and replication in recommender systems evaluation</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Dacrema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <article-title>Are we really making much progress? a worrying analysis of recent neural recommendation approaches</article-title>
          ,
          <source>in: Proceedings of the 13th ACM Conference on Recommender Systems</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>101</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Koren,
          <article-title>On the dificulty of evaluating baselines: A study on recommender systems</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>01395</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cañamares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Castells</surname>
          </string-name>
          ,
          <article-title>On target item sampling in ofline recommender system evaluation</article-title>
          ,
          <source>in: Fourteenth ACM Conference on Recommender Systems</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Geng, Are we evaluating rigorously?
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>