Abstract: Should Algorithm Evaluation Extend to
Testing? We Think So.
Lien Michiels1,2,† , Robin Verachtert1,2,† , Kim Falk3,† and Bart Goethals1,2,4
1
  Froomle N.V., Belgium
2
  University of Antwerp, Antwerp, Belgium
3
  Shopify, Canada
4
  Monash University, Melbourne, Australia


                                         Abstract
                                         Software engineers test virtually all of their code through unit, regression and integration tests. In
                                         contrast, data scientists and machine learning engineers often evaluate models based solely on their
                                         training or evaluation loss and task performance metrics such as accuracy, precision or recall. When ‘code’
                                         becomes ‘algorithms’, software best practices are often neglected. In our research, we found that most
                                         publicly available algorithm implementations indeed are not tested beyond ranking performance metrics,
                                         such as recall and normalized discounted cumulative gain. Applying software testing best practices
                                         to algorithms can seem daunting (and unnecessary). However, software packages like scikit-learn and
                                         SpaCy have demonstrated that it definitely is possible to test (at least some aspects of) algorithms. We
                                         believe that algorithms should be tested. Without tests, you may just end up with dead code paths,
                                         gradients that do not update, or logical errors you failed to detect. The question then becomes: How
                                         should we test algorithms? During the workshop, we would like to open up this discussion. We start with
                                         an overview of software testing paradigms: from black-box to white-box testing, unit to regression testing
                                         and more. We then present some examples of testing patterns we have applied to our recommendation
                                         algorithm implementations. At the end of the discussion, we hope to have answered some of the following
                                         questions: (1) Should recommendation algorithms be tested? (2) What aspects of the recommendation
                                         algorithm would benefit most from testing? (3) How can we translate these software testing paradigms
                                         to recommendation algorithms? (4) What sorts of tests can we design? (5) Which of these tests should
                                         be part of a researcher’s standard experimentation pipeline?
                                             We plan to summarize the conclusions of this discussion in a future publication, accompanied by a
                                         testing toolkit.


Perspectives on the Evaluation of Recommender Systems Workshop (PERSPECTIVES 2022), September 22nd, 2022,
co-located with the 16th ACM Conference on Recommender Systems, Seattle, WA, USA.
†
     These authors contributed equally.
Envelope-Open lien.michiels@froomle.com (L. Michiels); robin.verachtert@froomle.com (R. Verachtert);
Kim.falk.jorgensen@gmail.com (K. Falk); bart.goethals@uantwerpen.be (B. Goethals)
Orcid 0000-0003-0152-2460 (L. Michiels); 0000-0003-0345-7770 (R. Verachtert); 0000-0002-3573-9257 (K. Falk);
0000-0001-9327-9554 (B. Goethals)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                           1