<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reproducible Experiments in the</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Discussion Paper</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vito Walter Anelli</string-name>
          <email>vitowalter.anelli@poliba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Bellogín</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Ferrara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Malitesta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Antonio Merra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Pomo</string-name>
          <email>claudio.pomo@poliba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Maria Donini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eugenio Di</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sciascio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Di Noia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Recommender Systems, Reproducibility, Adversarial Learning, Visual Recommenders, Knowledge Graphs</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Bari</institution>
          ,
          <addr-line>via Orabona, 4, 70125 Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Autónoma de Madrid</institution>
          ,
          <addr-line>Ciudad Universitaria de Cantoblanco, 28049 Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università degli Studi della Tuscia</institution>
          ,
          <addr-line>via Santa Maria in Gradi, 4, 01100 Viterbo</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>Recommender Systems have shown to be an efective way to alleviate the over-choice problem and provide accurate and tailored recommendations. However, the impressive number of proposed recommendation algorithms, splitting strategies, evaluation protocols, metrics, and tasks, has made rigorous experimental evaluation particularly challenging. ELLIOT is a comprehensive recommendation framework that aims to run and reproduce an entire experimental pipeline by processing a simple configuration file. The framework loads, filters, and splits the data considering a vast set of strategies. Then, it optimizes hyperparameters for several recommendation algorithms, selects the best models, compares them with the baselines, computes metrics spanning from accuracy to beyond-accuracy, bias, and fairness, and conducts statistical analysis. The aim is to provide researchers a tool to ease all the experimental evaluation phases (and make them reproducible), from data reading to results collection. ELLIOT is freely available on GitHub at https://github.com/sisinflab/elliot.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Background</title>
      <p>
        Recommendation Systems (RSs) have risen to prominence in the recent decade as the go-to option
for personalized decision-support systems. Recommendation is a retrieval task in which a catalog
of products is scored, and the highest-scoring items are shown to the user. Both academia and
industry have focused their attention on RSs, as they were proven to supply customized goods
to users. This collaborative efort yielded a diverse set of recommendation algorithms, spanning
from memory-based to latent factor-based and deep learning-based approaches. However, the
RSs community has become increasingly aware that adequately evaluating a model is not limited
to measuring accuracy metrics alone. Another aspect that has attracted much attention concerns
the evaluation of these models. While it is widely recognized the importance of beyond-accuracy
metrics, an additional efort needed to compare models rigorously and fairly with each other in
order to justify why one model performs diferently from another. The problem of reproducing
the experiments recurs when the need to recompute the whole set of experiments emerges. Be it
a new experiment or not, it opens the doors to another class of problems: the number of possible
design choices often imposes the researcher to define and implement only the chosen (usually
limited) experimental setting. As highlighted in Konstan and Adomavicius [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], RS assessment is
an essential and developing research issue connected to reproducibility, which is a cornerstone
of the scientific process. Recently, academics have taken a closer look at this topic, also because
the relevance and efect of such discoveries would rise depending on how well we assess the
performance of a system. Some academics suggest that at least the following four steps should
be defined within the assessment procedure to improve replicability and allow fair comparisons
across various works (either frameworks, research papers, or published artifacts) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: data
splitting, item suggestions, candidate item creation, and performance monitoring, which may be
all done with data. These phases were completed with dataset collecting and statistical testing in
a recent study [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Depending on the performance dimension to examine, several of these phases
can be further classified, like performance measurement. Gunawardana and Shani [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] reviews
diferent performance characteristics of RSs, comparing some measures, e.g., accuracy, coverage,
confidence, trust, novelty, variety, and serendipity. However, to the best of our knowledge, no
public implementation that provides more than one or two of these aspects exists. Furthermore,
other dimensions such as bias (in particular, popularity bias [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) and fairness [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have lately
been explored by the community [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ].
      </p>
      <p>
        Reproducibility is the keystone of modern RSs research. Dacrema et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Rendle et al.
[10] have recently raised the need of comprehensive and fair recommender model evaluation.
However, the outstanding success and the community interests in Deep Learning (DL)
recommendation models raised the need for novel instruments. LibRec [11], Spotlight [12], and
OpenRec [13] were the first open-source projects that made DL-based recommenders available
– with less than a dozen of available models without filtering, splitting, and hyper-optimization
tuning strategies. However, they do not provide a general tool for extensive experiments on the
pre-elaboration and the evaluation of a dataset. Indeed, after the reproducibility hype [
        <xref ref-type="bibr" rid="ref9">9, 10</xref>
        ],
DaisyRec [14] and RecBole [15] raised the bar of framework capabilities, making available both
large set of models, data filtering/splitting and, above all, hyper-parameter tuning features.
      </p>
      <p>From the researcher’s point of view, our framework solves the issues mentioned above.
ELLIOT [16] natively provides widespread research evaluation features, like the analysis of
multiple cut-ofs and several RSs. ELLIOT supplies, to date, 36 metrics, 13 splitting strategies,
and 8 prefiltering policies to evaluate the diverse tasks and domains. Moreover, the framework
ofers, to date, 27 similarities, and 51 hyperparameter tuning combined approaches.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Elliot</title>
      <p>ELLIOT (Figure 1) is an extendable framework with eight functional modules, each of which is
in charge of a diferent aspect of the experimental suggestion process. The user is only meant</p>
      <p>P
Filter-by-rating
k-core</p>
      <p>S
Temporal
Random
Fix</p>
      <p>L
Ratings
Side Information</p>
      <p>R
Restore Model
Restore Model
Restore</p>
      <p>External Model</p>
      <p>M
Accuracy
Error
Coverage
Novelty</p>
      <p>Diversity
H FBaiairsness</p>
      <p>O
Performance Tables
Model Weights
Recommendation Lists</p>
      <p>S . T
Paired t-test
Wilcoxon</p>
      <p>Con guration File
Data Modules
Run Module
Evaluation Modules
Output Module
Optional Modules
to input human-level experimental flow information via a configurable configuration file, so
what happens behind the scenes is transparent. Hence, ELLIOT allows the execution of the
whole pipeline. In the following, we detail each module and how to create a configuration file.</p>
      <sec id="sec-2-1">
        <title>2.1. Data Preparation</title>
        <p>Loading. Diferent data sources, such as user-item feedback, and extra side information, may be
required for RSs experiments. Hence, ELLIOT has a variety of Loading module implementations.
The researcher/experiment designer may create prefiltering and splitting custom methods that
can be saved and loaded to save time in the future. Additional data, such as visual [17, 18] and
semantic features [19, 20, 21, 22], can be handled through a specific data loader.
Prefiltering. ELLIOT ofers data filtering options based on two diferent techniques. The first
is Filter-by-rating, whose purpose is to eliminate user-item interactions if the preference score
is below a certain level. It can be (i) a Numerical value, e.g., 3.5, (ii) a Distributional detail, e.g.,
global rating average value, or (iii) a user-based distributional (User Dist.) value, e.g., user’s
average rating value. The second,  -core, filters out users, items, or both, with less than 
recorded interactions. It can proceed iteratively (Iterative  -core) on both users and items until
the filtering condition is met, i.e., all the users and items have at least  recorded interaction.
Finally, the Cold-Users filtering allows retaining cold-users only.</p>
        <p>Splitting. ELLIOT implements three splitting strategies: (i) Temporal, (ii) Random, and (iii) Fix.
The Temporal method divides user-item interactions depending on the transaction timestamp,
either by setting the timestamp, selecting the best one [23, 24], or using a hold-out (HO)
mechanism. Hold-out (HO),  -repeated hold-out (K-HO), and cross-validation (CV ) are all part
of the Random methods. Finally, the Fix approach leverages an already split dataset.
Recommendation Models. The Recommendation module provides the functionalities to train
(and restore) the ELLIOT recommendation models and the new ones integrated by users. To
date, ELLIOT integrates around 50 recommendation models partitioned into two sets: (i) 38
popular models implemented in at least two of the other reviewed frameworks, (ii) other
wellknown state-of-the-art recommendation models implemented in less than two frameworks, like,
MultiDAE [25], graph-learning, e.g., NGCF [26], visual [27], e.g., VBPR [28], adversarial-robust,
e.g., AMR [29], and MSAPMF [30], content-aware, e.g., KaHFM [19], and KGFlex [31].
Hyper-parameter Tuning. According to Rendle et al. [10], Anelli et al. [32],
hyperparameter optimization has a significant impact on performance. Grid Search, Simulated Annealing,
Bayesian Optimization, and Random Search are all ofered by ELLIOT. Additionally, it supports
four diferent traversal techniques in the search space. Grid Search is automatically inferred
when the user specifies the available hyperparameters.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Performance Evaluation</title>
        <p>Metrics. ELLIOT provides a set of 36 evaluation metrics, partitioned into seven families:
Accuracy [33, 34], Error, Coverage, Novelty [35], Diversity [36], Bias [37, 38, 39, 40, 41], and
Fairness [42, 43]. It is worth mentioning that ELLIOT is the framework that exposes both the
largest number of metrics and the only one considering bias and fairness measures. Moreover,
the practitioner can choose any metric to drive the model selection and the tuning.
Statistical Tests. All other cited frameworks do not support statistical hypothesis tests,
probably due to the need for computing fine-grained (e.g., per-user or per-partition) results and
retaining them for each recommendation model. Conversely, ELLIOT helps computing two
statistical hypothesis tests, i.e., Wilcoxon and Paired t-test, with a flag in the configuration file.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Framework Outcomes</title>
        <p>When the training of recommenders is over, ELLIOT uses the Output module to gather the
results. Three types of output files can be generated: (i) Performance Tables, (ii) Model Weights,
and (iii) Recommendation Lists. Performance Tables come in the form of spreadsheets, including
all the metric values generated on the test set for each recommendation model given in the
configuration file. Cut-of-specific and model-specific tables are included in a final report (i.e.,
considering each combination of the explored parameters). Statistical hypothesis tests are also
presented in the tables, as well as a JSON file that summarizes the optimal model parameters.
Optionally, ELLIOT stores the model weights for the sake of future re-training.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Preparation of the Experiment</title>
        <p>ELLIOT is triggered by a single configuration file written in YAML (e.g., refer to the toy
example s a m p l e _ h e l l o _ w o r l d . y m l ). The first section details the data loading, filtering, and
splitting information defined in Section 2.1. The m o d e l s section represents the recommendation
models’ configuration, e.g., Item-  NN. Here, the model-specific hyperparameter optimization
strategies are specified, e.g., the grid-search. The e v a l u a t i o n section details the evaluation
strategy with the desired metrics, e.g., nDCG in the toy example. Finally, s a v e _ r e c s and t o p _ k
keys detail, for example, the Output module abilities described in Section 2.3.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion and Future Work</title>
      <p>ELLIOT is a framework that perform the entire recommendation process from an RS researcher’s
perspective. It requires the practitioner/researcher to write a configuration file to conduct a
rigorous and reproducible experimental evaluation. The framework provides several
functionalities: loading, prefiltering, splitting, hyperparameter optimization strategies, recommendation
models, and statistical hypothesis tests. To the best of our knowledge, ELLIOT is the first
recommendation framework providing an entire multi-recommender experimental pipeline
based on a simple configuration file. We plan to extend ELLIOT in various directions to include:
sequential recommendation scenarios, adversarial attacks, reinforcement learning-based
recommendation systems, diferential privacy facilities, sampled evaluation, and federated/ distributed
recommendation.
P. Brusilovsky, D. Tikk (Eds.), Proceedings of the 13th ACM Conference on Recommender
Systems, RecSys 2019, Copenhagen, Denmark, September 16-20, 2019, ACM, 2019, pp.
101–109. URL: https://doi.org/10.1145/3298689.3347058. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 5 8 .
[10] S. Rendle, W. Krichene, L. Zhang, J. R. Anderson, Neural collaborative filtering vs. matrix
factorization revisited, in: R. L. T. Santos, L. B. Marinho, E. M. Daly, L. Chen, K. Falk,
N. Koenigstein, E. S. de Moura (Eds.), RecSys 2020: Fourteenth ACM Conference on
Recommender Systems, Virtual Event, Brazil, September 22-26, 2020, ACM, 2020, pp.
240–248. URL: https://doi.org/10.1145/3383313.3412488. doi:1 0 . 1 1 4 5 / 3 3 8 3 3 1 3 . 3 4 1 2 4 8 8 .
[11] G. Guo, J. Zhang, Z. Sun, N. Yorke-Smith, Librec: A java library for recommender systems,
in: A. I. Cristea, J. Masthof, A. Said, N. Tintarev (Eds.), Posters, Demos, Late-breaking
Results and Workshop Proceedings of the 23rd Conference on User Modeling, Adaptation,
and Personalization (UMAP 2015), Dublin, Ireland, June 29 - July 3, 2015, volume 1388
of CEUR Workshop Proceedings, CEUR-WS.org, 2015. URL: http://ceur-ws.org/Vol-1388/
demo_paper1.pdf.
[12] M. Kula, Spotlight, https://github.com/maciejkula/spotlight, 2017.
[13] L. Yang, E. Bagdasaryan, J. Gruenstein, C. Hsieh, D. Estrin, Openrec: A modular framework
for extensible and adaptable recommendation algorithms, in: Y. Chang, C. Zhai, Y. Liu,
Y. Maarek (Eds.), Proceedings of the Eleventh ACM International Conference on Web Search
and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, ACM, 2018,
pp. 664–672. URL: https://doi.org/10.1145/3159652.3159681. doi:1 0 . 1 1 4 5 / 3 1 5 9 6 5 2 . 3 1 5 9 6 8 1 .
[14] Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang, C. Geng, Are we evaluating rigorously?
benchmarking recommendation for reproducible evaluation and fair comparison, in:
R. L. T. Santos, L. B. Marinho, E. M. Daly, L. Chen, K. Falk, N. Koenigstein, E. S. de Moura
(Eds.), RecSys 2020: Fourteenth ACM Conference on Recommender Systems, Virtual Event,
Brazil, September 22-26, 2020, ACM, 2020, pp. 23–32. URL: https://doi.org/10.1145/3383313.
3412489. doi:1 0 . 1 1 4 5 / 3 3 8 3 3 1 3 . 3 4 1 2 4 8 9 .
[15] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, K. Li, Y. Chen, Y. Lu, H. Wang, C. Tian, X. Pan, Y. Min,
Z. Feng, X. Fan, X. Chen, P. Wang, W. Ji, Y. Li, X. Wang, J. Wen, Recbole: Towards a
unified, comprehensive and eficient framework for recommendation algorithms, CoRR
abs/2011.01731 (2020). URL: https://arxiv.org/abs/2011.01731. a r X i v : 2 0 1 1 . 0 1 7 3 1 .
[16] V. W. Anelli, A. Bellogín, A. Ferrara, D. Malitesta, F. A. Merra, C. Pomo, F. M. Donini, T. D.</p>
      <p>Noia, Elliot: A comprehensive and rigorous framework for reproducible recommender
systems evaluation, in: SIGIR, ACM, 2021, pp. 2405–2414.
[17] W. Kang, C. Fang, Z. Wang, J. J. McAuley, Visually-aware fashion recommendation and
design with generative image models, in: V. Raghavan, S. Aluru, G. Karypis, L. Miele,
X. Wu (Eds.), 2017 IEEE International Conference on Data Mining, ICDM 2017, New
Orleans, LA, USA, November 18-21, 2017, IEEE Computer Society, 2017, pp. 207–216. URL:
https://doi.org/10.1109/ICDM.2017.30. doi:1 0 . 1 1 0 9 / I C D M . 2 0 1 7 . 3 0 .
[18] J. Chen, H. Zhang, X. He, L. Nie, W. Liu, T. Chua, Attentive collaborative filtering:
Multimedia recommendation with item- and component-level attention, in: N. Kando,
T. Sakai, H. Joho, H. Li, A. P. de Vries, R. W. White (Eds.), Proceedings of the 40th
International ACM SIGIR Conference on Research and Development in Information
Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017, ACM, 2017, pp. 335–344. URL:
https://doi.org/10.1145/3077136.3080797. doi:1 0 . 1 1 4 5 / 3 0 7 7 1 3 6 . 3 0 8 0 7 9 7 .
[19] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ragone, J. Trotta, How to make latent factors
interpretable by feeding factorization machines with knowledge graphs, in: C.
Ghidini, O. Hartig, M. Maleshkova, V. Svátek, I. F. Cruz, A. Hogan, J. Song, M. Lefrançois,
F. Gandon (Eds.), The Semantic Web - ISWC 2019 - 18th International Semantic Web
Conference, Auckland, New Zealand, October 26-30, 2019, Proceedings, Part I,
volume 11778 of Lecture Notes in Computer Science, Springer, 2019, pp. 38–56. URL: https:
//doi.org/10.1007/978-3-030-30793-6_3. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 3 0 7 9 3 - 6 \ _ 3 .
[20] V. W. Anelli, A. Calì, T. D. Noia, M. Palmonari, A. Ragone, Exposing open street map in the
linked data cloud, in: IEA/AIE, volume 9799 of Lecture Notes in Computer Science, Springer,
2016, pp. 344–355.
[21] V. W. Anelli, T. D. Noia, P. Lops, E. D. Sciascio, Feature factorization for top-n
recommendation: From item rating to features relevance, in: RecSysKTL, volume 1887 of CEUR
Workshop Proceedings, CEUR-WS.org, 2017, pp. 16–21.
[22] V. W. Anelli, P. Basile, D. G. Bridge, T. D. Noia, P. Lops, C. Musto, F. Narducci, M. Zanker,
Knowledge-aware and conversational recommender systems, in: RecSys, ACM, 2018, pp.
521–522.
[23] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ragone, J. Trotta, Local popularity and time in
top-n recommendation, in: L. Azzopardi, B. Stein, N. Fuhr, P. Mayr, C. Hauf, D. Hiemstra
(Eds.), Advances in Information Retrieval - 41st European Conference on IR Research,
ECIR 2019, Cologne, Germany, April 14-18, 2019, Proceedings, Part I, volume 11437 of
Lecture Notes in Computer Science, Springer, 2019, pp. 861–868. URL: https://doi.org/10.
1007/978-3-030-15712-8_63. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 1 5 7 1 2 - 8 \ _ 6 3 .
[24] A. Bellogín, P. Sánchez, Revisiting neighbourhood-based recommenders for temporal
scenarios, in: M. Bieliková, V. Bogina, T. Kuflik, R. Sasson (Eds.), Proceedings of the
1st Workshop on Temporal Reasoning in Recommender Systems co-located with 11th
International Conference on Recommender Systems (RecSys 2017), Como, Italy, August
27-31, 2017, volume 1922 of CEUR Workshop Proceedings, CEUR-WS.org, 2017, pp. 40–44.</p>
      <p>URL: http://ceur-ws.org/Vol-1922/paper8.pdf.
[25] D. Liang, R. G. Krishnan, M. D. Hofman, T. Jebara, Variational autoencoders for
collaborative filtering, in: P. Champin, F. L. Gandon, M. Lalmas, P. G. Ipeirotis (Eds.), Proceedings
of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France,
April 23-27, 2018, ACM, 2018, pp. 689–698. URL: https://doi.org/10.1145/3178876.3186150.
doi:1 0 . 1 1 4 5 / 3 1 7 8 8 7 6 . 3 1 8 6 1 5 0 .
[26] X. Wang, X. He, M. Wang, F. Feng, T. Chua, Neural graph collaborative filtering, in:
B. Piwowarski, M. Chevalier, É. Gaussier, Y. Maarek, J. Nie, F. Scholer (Eds.), Proceedings
of the 42nd International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, ACM, 2019, pp. 165–174. URL:
https://doi.org/10.1145/3331184.3331267. doi:1 0 . 1 1 4 5 / 3 3 3 1 1 8 4 . 3 3 3 1 2 6 7 .
[27] V. W. Anelli, A. Bellogín, A. Ferrara, D. Malitesta, F. A. Merra, C. Pomo, F. M. Donini, T. D.</p>
      <p>Noia, V-elliot: Design, evaluate and tune visual recommender systems, in: RecSys 2021:
Fifteenth ACM Conference on Recommender Systems (RecSys ’21), September 27-October 1,
2021, Amsterdam, Netherlands, ACM, 2021. URL: https://doi.org/10.1145/3460231.3478881.
doi:1 0 . 1 1 4 5 / 3 4 6 0 2 3 1 . 3 4 7 8 8 8 1 .
[28] R. He, J. J. McAuley, VBPR: visual bayesian personalized ranking from implicit feedback,
in: D. Schuurmans, M. P. Wellman (Eds.), Proceedings of the Thirtieth AAAI Conference
on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, AAAI Press, 2016,
pp. 144–150. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11914.
[29] J. Tang, X. Du, X. He, F. Yuan, Q. Tian, T. Chua, Adversarial training towards robust
multimedia recommender system, IEEE Trans. Knowl. Data Eng. 32 (2020) 855–867. URL:
https://doi.org/10.1109/TKDE.2019.2893638. doi:1 0 . 1 1 0 9 / T K D E . 2 0 1 9 . 2 8 9 3 6 3 8 .
[30] V. W. Anelli, A. Bellogín, Y. Deldjoo, T. Di Noia, F. A. Merra, Msap: Multi-step adversarial
perturbations on recommender systems embeddings, The International FLAIRS Conference
Proceedings 34 (2021). doi:1 0 . 3 2 4 7 3 / f l a i r s . v 3 4 i 1 . 1 2 8 4 4 3 .
[31] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ferrara, A. C. M. Mancino, Sparse feature
factorization for recommender systems with knowledge graphs, in: RecSys 2021: Fifteenth
ACM Conference on Recommender Systems (RecSys ’21), September 27-October 1, 2021,
Amsterdam, Netherlands, ACM, 2021. URL: https://doi.org/10.1145/3460231.3474243. doi:1 0 .
1 1 4 5 / 3 4 6 0 2 3 1 . 3 4 7 4 2 4 3 .
[32] V. W. Anelli, T. D. Noia, E. D. Sciascio, C. Pomo, A. Ragone, On the discriminative power
of hyper-parameters in cross-validation and how to choose them, in: T. Bogers, A. Said,
P. Brusilovsky, D. Tikk (Eds.), Proceedings of the 13th ACM Conference on Recommender
Systems, RecSys 2019, Copenhagen, Denmark, September 16-20, 2019, ACM, 2019, pp.
447–451. URL: https://doi.org/10.1145/3298689.3347010. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 1 0 .
[33] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, K. Gai, Deep
interest network for click-through rate prediction, in: Y. Guo, F. Farooq (Eds.), Proceedings
of the 24th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data
Mining, KDD 2018, London, UK, August 19-23, 2018, ACM, 2018, pp. 1059–1068. URL:
https://doi.org/10.1145/3219819.3219823. doi:1 0 . 1 1 4 5 / 3 2 1 9 8 1 9 . 3 2 1 9 8 2 3 .
[34] G. Schröder, M. Thiele, W. Lehner, Setting goals and choosing metrics
for recommender system evaluations, volume 811, 2011, pp. 78–85. URL:
https://www.scopus.com/inward/record.uri?eid=2-s2.0-84891939277&amp;partnerID=
40&amp;md5=c5b68f245b2e03725e6e5acc1e3c6289.
[35] S. Vargas, P. Castells, Rank and relevance in novelty and diversity metrics for recommender
systems, in: B. Mobasher, R. D. Burke, D. Jannach, G. Adomavicius (Eds.), Proceedings of the
2011 ACM Conference on Recommender Systems, RecSys 2011, Chicago, IL, USA, October
23-27, 2011, ACM, 2011, pp. 109–116. URL: https://dl.acm.org/citation.cfm?id=2043955.
[36] C. Zhai, W. W. Cohen, J. D. Laferty, Beyond independent relevance: methods and
evaluation metrics for subtopic retrieval, in: C. L. A. Clarke, G. V. Cormack, J. Callan, D. Hawking,
A. F. Smeaton (Eds.), SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, July 28 - August 1,
2003, Toronto, Canada, ACM, 2003, pp. 10–17. URL: https://doi.org/10.1145/860435.860440.
doi:1 0 . 1 1 4 5 / 8 6 0 4 3 5 . 8 6 0 4 4 0 .
[37] H. Abdollahpouri, R. Burke, B. Mobasher, Managing popularity bias in recommender
systems with personalized re-ranking, in: R. Barták, K. W. Brawner (Eds.), Proceedings of
the Thirty-Second International Florida Artificial Intelligence Research Society Conference,
Sarasota, Florida, USA, May 19-22 2019, AAAI Press, 2019, pp. 413–418. URL: https://aaai.
org/ocs/index.php/FLAIRS/FLAIRS19/paper/view/18199.
[38] H. Abdollahpouri, R. Burke, B. Mobasher, Controlling popularity bias in learning-to-rank
recommendation, in: P. Cremonesi, F. Ricci, S. Berkovsky, A. Tuzhilin (Eds.), Proceedings
of the Eleventh ACM Conference on Recommender Systems, RecSys 2017, Como, Italy,
August 27-31, 2017, ACM, 2017, pp. 42–46. URL: https://doi.org/10.1145/3109859.3109912.
doi:1 0 . 1 1 4 5 / 3 1 0 9 8 5 9 . 3 1 0 9 9 1 2 .
[39] H. Yin, B. Cui, J. Li, J. Yao, C. Chen, Challenging the long tail recommendation, Proc. VLDB
Endow. 5 (2012) 896–907. URL: http://vldb.org/pvldb/vol5/p896_hongzhiyin_vldb2012.pdf.
doi:1 0 . 1 4 7 7 8 / 2 3 1 1 9 0 6 . 2 3 1 1 9 1 6 .
[40] Z. Zhu, J. Wang, J. Caverlee, Measuring and mitigating item under-recommendation bias
in personalized ranking systems, in: J. Huang, Y. Chang, X. Cheng, J. Kamps, V. Murdock,
J. Wen, Y. Liu (Eds.), Proceedings of the 43rd International ACM SIGIR conference on
research and development in Information Retrieval, SIGIR 2020, Virtual Event, China,
July 25-30, 2020, ACM, 2020, pp. 449–458. URL: https://doi.org/10.1145/3397271.3401177.
doi:1 0 . 1 1 4 5 / 3 3 9 7 2 7 1 . 3 4 0 1 1 7 7 .
[41] V. Tsintzou, E. Pitoura, P. Tsaparas, Bias disparity in recommendation systems, in:
R. Burke, H. Abdollahpouri, E. C. Malthouse, K. P. Thai, Y. Zhang (Eds.), Proceedings of the
Workshop on Recommendation in Multi-stakeholder Environments co-located with the
13th ACM Conference on Recommender Systems (RecSys 2019), Copenhagen, Denmark,
September 20, 2019, volume 2440 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. URL:
http://ceur-ws.org/Vol-2440/short4.pdf.
[42] Y. Deldjoo, V. W. Anelli, H. Zamani, A. Bellogin, T. Di Noia, A flexible framework for
evaluating user and item fairness in recommender systems, User Modeling and
UserAdapted Interaction (2020) 1–47.
[43] Z. Zhu, X. Hu, J. Caverlee, Fairness-aware tensor-based recommendation, in: A. Cuzzocrea,
J. Allan, N. W. Paton, D. Srivastava, R. Agrawal, A. Z. Broder, M. J. Zaki, K. S. Candan,
A. Labrinidis, A. Schuster, H. Wang (Eds.), Proceedings of the 27th ACM International
Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October
22-26, 2018, ACM, 2018, pp. 1153–1162. URL: https://doi.org/10.1145/3269206.3271795.
doi:1 0 . 1 1 4 5 / 3 2 6 9 2 0 6 . 3 2 7 1 7 9 5 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          , G. Adomavicius,
          <article-title>Toward identification and adoption of best practices in algorithmic recommender systems research</article-title>
          , in: A.
          <string-name>
            <surname>Bellogín</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Castells</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Said</surname>
          </string-name>
          , D. Tikk (Eds.),
          <source>Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation, RepSys</source>
          <year>2013</year>
          ,
          <string-name>
            <given-names>Hong</given-names>
            <surname>Kong</surname>
          </string-name>
          , China, October
          <volume>12</volume>
          ,
          <year>2013</year>
          , ACM,
          <year>2013</year>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>28</lpage>
          . URL: https://doi.org/10.1145/2532508.2532513.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 2 5 3 2 5 0 8 . 2 5 3 2 5 1 3 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Said</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bellogín</surname>
          </string-name>
          ,
          <article-title>Comparative recommender system evaluation: benchmarking recommendation frameworks</article-title>
          , in: A.
          <string-name>
            <surname>Kobsa</surname>
            ,
            <given-names>M. X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ester</surname>
          </string-name>
          , Y. Koren (Eds.),
          <source>Eighth ACM Conference on Recommender Systems</source>
          , RecSys '14,
          <string-name>
            <surname>Foster</surname>
            <given-names>City</given-names>
          </string-name>
          , Silicon Valley, CA, USA - October 06 -
          <issue>10</issue>
          ,
          <year>2014</year>
          , ACM,
          <year>2014</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          . URL: https://doi.org/10.1145/2645710.2645746.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 2 6 4 5 7 1 0 . 2 6 4 5 7 4 6 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bellogín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Said</surname>
          </string-name>
          ,
          <article-title>Improving accountability in recommender systems research through reproducibility</article-title>
          ,
          <source>CoRR abs/2102</source>
          .00482 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2102.00482.
          <article-title>a r X i v : 2 1 0 2 . 0 0 4 8 2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gunawardana</surname>
          </string-name>
          , G. Shani,
          <article-title>Evaluating recommender systems</article-title>
          , in: F.
          <string-name>
            <surname>Ricci</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Rokach</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Shapira (Eds.),
          <source>Recommender Systems Handbook</source>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>265</fpage>
          -
          <lpage>308</lpage>
          . URL: https://doi.org/10.1007/978-1-
          <fpage>4899</fpage>
          -7637-
          <issue>6</issue>
          _8.
          <source>doi:1 0 . 1 0</source>
          <volume>0 7 / 9 7 8 - 1 - 4 8 9 9 - 7 6 3 7 - 6</volume>
          \ _ 8 .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Abdollahpouri</surname>
          </string-name>
          ,
          <article-title>Popularity bias in ranking and recommendation</article-title>
          , in: V.
          <string-name>
            <surname>Conitzer</surname>
            ,
            <given-names>G. K.</given-names>
          </string-name>
          <string-name>
            <surname>Hadfield</surname>
          </string-name>
          , S. Vallor (Eds.),
          <source>Proceedings of the 2019 AAAI/ACM Conference on AI</source>
          ,
          <string-name>
            <surname>Ethics</surname>
          </string-name>
          , and Society, AIES 2019,
          <article-title>Honolulu</article-title>
          ,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA, January
          <volume>27</volume>
          -
          <issue>28</issue>
          ,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>529</fpage>
          -
          <lpage>530</lpage>
          . URL: https://doi.org/10.1145/3306618.3314309.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 3 0 6 6 1 8 . 3 3 1 4 3 0 9 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Ekstrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Burke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <article-title>Fairness and discrimination in retrieval and recommendation</article-title>
          , in: B.
          <string-name>
            <surname>Piwowarski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chevalier</surname>
            , É. Gaussier,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Maarek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Scholer</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2019</year>
          , Paris, France,
          <source>July 21-25</source>
          ,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>1403</fpage>
          -
          <lpage>1404</lpage>
          . URL: https://doi.org/10.1145/3331184.3331380.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 3 3 1 1 8 4 . 3 3 3 1 3 8 0 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ardito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Sciascio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lofú</surname>
          </string-name>
          , G. Mallardi,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vitulano</surname>
          </string-name>
          ,
          <article-title>Towards a trustworthy patient home-care thanks to an edge-node infrastructure</article-title>
          ,
          <source>in: HCSE</source>
          , volume
          <volume>12481</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2020</year>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Donini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Narducci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ragone</surname>
          </string-name>
          ,
          <article-title>Explanation in multi-stakeholder recommendation for enterprise decision support systems</article-title>
          ,
          <source>in: CAiSE Workshops</source>
          , volume
          <volume>423</volume>
          <source>of Lecture Notes in Business Information Processing</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Dacrema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <article-title>Are we really making much progress? A worrying analysis of recent neural recommendation approaches</article-title>
          , in: T.
          <string-name>
            <surname>Bogers</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Said,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>