=Paper= {{Paper |id=Vol-3177/paper19 |storemode=property |title=Replication of Collaborative Filtering Generative Adversarial Networks on Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-3177/paper19.pdf |volume=Vol-3177 |authors=Fernando Benjamín Pérez Maurera,Maurizio Ferrari Dacrema,Paolo Cremonesi |dblpUrl=https://dblp.org/rec/conf/iir/MaureraDC22 }} ==Replication of Collaborative Filtering Generative Adversarial Networks on Recommender Systems== https://ceur-ws.org/Vol-3177/paper19.pdf
Replication of Collaborative Filtering Generative
Adversarial Networks on Recommender Systems
Discussion Paper

Fernando B. Pérez Maurera1,2,* , Maurizio Ferrari Dacrema1 and Paolo Cremonesi1
1
    Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano, Italy
2
    ContentWise, Via Simone Schiaffino 11, Milano, 20158, Milano, Italy


                                         Abstract
                                         CFGAN and its family of models (TagRec, MTPR, and CRGAN) learn to generate personalized and
                                         fake-but-realistic preferences for top-N recommendations by solely using previous interactions. The
                                         work discusses the impact of certain differences between the CFGAN framework and the model used
                                         in the original evaluation. The absence of random noise and the use of real user profiles as condition
                                         vectors leaves the generator prone to learn a degenerate solution in which the output vector is identical
                                         to the input vector, therefore, behaving essentially as a simple auto-encoder. This work further expands
                                         the experimental analysis comparing CFGAN against a selection of simple and well-known properly
                                         optimized baselines, observing that CFGAN is not consistently competitive against them despite its high
                                         computational cost. This work is an extended abstract of the paper presented in [1].

                                         Keywords
                                         Generative Adversarial Networks, Recommender Systems, Collaborative Filtering, Replicability




1. Introduction
Evaluation studies of previous works are fundamental to validate previous claimed progress.
Several works have indicated the importance of such studies [2, 3, 4, 5, 6] for researchers and
practitioners. This work presents an evaluation study of the most notable generative model
applied in Recommender Systems: Collaborative Filtering GAN (CFGAN) [7].1
   CFGAN [7] is a recommendation model based on Generative Adversarial Networks (GANs). It
consists of two fully-connected feed-forward neural networks trained in an adversarial setting:
a generator and a discriminator. Figure 1 illustrates the adversarial training of CFGAN. The
generator learns to generate user profiles describing the preference of users toward items. The
discriminator learns to distinguish between real user profiles and those created by the generator.
Training of CFGAN converges when the generator creates fake but realistic user profiles. For a


IIR2022: 12th Italian Information Retrieval Workshop, June 29 - June 30th, 2022, Milan, Italy
*
  Corresponding author.
$ fernandobenjamin.perez@polimi.it (Fernando B. Pérez Maurera); maurizio.ferrari@polimi.it (M. Ferrari
Dacrema); paolo.cremonesi@polimi.it (P. Cremonesi)
 0000-0001-6578-7404 (Fernando B. Pérez Maurera); 0000-0001-7103-2788 (M. Ferrari Dacrema);
0000-0002-1253-8081 (P. Cremonesi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
            CEUR Workshop Proceedings (CEUR-WS.org)
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073




1
    This work is an extended abstract of the work published in [1].
                Start                                                                 Real
                                                                                     Profiles

                                                                   Mask
                Start    z
                                        G           Generated                        Masked
                         c                           Profiles                        Profiles
                                                                                                         c

                             Update G            Probability
                                                                                                   D
                                            Masked Profiles are Real



                                                                                          Probability
                                                                                     Real Profiles are Real

                                                                          Update D

Figure 1: Training process of CFGAN. 𝐺, 𝐷, 𝑧 and 𝑐 are the generator network, discriminator network,
random noise, and condition vectors, respectively. Real profiles are not masked.


given user, CFGAN constructs their recommendations by selecting the top-N items with the
highest generated preference score.
   This work presents and discusses the results of several experiments on CFGAN with the
goal of addressing two research objectives. First, to describe inconsistencies found between
the formulation of CFGAN and the implementation of it used in [7]. Second, to replicate
the claimed progress made in [7] by measuring the CFGAN quality under a traditional top-N
recommendation scenario against properly-tuned baselines. The discussions presented here are
aligned with the exhortation given in [8]: research works should focus on understanding and
analyzing the proposed models.


2. Inconsistencies in CFGAN
Figure 1 presents the architecture and training process of CFGAN as described in the reference
of CFGAN [7]. From the figure, two vectors are part of the architecture of CFGAN: the random
and condition vectors, denoted as 𝑧 and 𝑐, respectively. This work highlights inconsistencies
between the implementation and the reference CFGAN presented in [7]: the use of user profiles
as the condition vector and the absence of random noise for the empirical evaluation. These
inconsistencies raise concerns about the model’s ability to generalize and provide personalized
recommendations.
   First, the condition vector is used to provide personalized recommendations. To achieve
this, this vector is encoded with users features, e.g., location, social information, identifiers,
among others. Due to the collaborative nature of the datasets used in the experiments of the
reference CFGAN [7], the user profiles are used as condition vectors, i.e., the data points that
the generator and discriminator learn from. Using the user profiles as the condition vector
makes both networks prone to learn trivial solutions. Essentially, the generator fundamentally
becomes an auto-encoder and the discriminator may degenerate into learning a function that
compares the condition with the real or generated profile.
   Second, from a theoretical standpoint, the random noise is required on traditional GANs to
explore several points and to create a mapping between the random to the data spaces. The
random noise vector is also part of the CFGAN reference and it serves the same purpose as for
traditional GANs. However, the implementation of CFGAN in [7] removes the random noise
from the model. Due to the absence of random noise, CFGAN is trained on highly sparse user
profiles without the exploration of different input spaces. Furthermore, removing the random
noise implicitly makes the assumption user profiles are static over time. As a consequence,
CFGAN is less robust to evolving users preferences and dataset shifts [9].


3. Experimental Methodology
This work presents an evaluation study comprised of several experiments on CFGAN. The goal
of this evaluation study is two-fold. First, to replicate the progress claims made in [7], where
“replicability” is defined as in the ACM Artifact Review and Badging, version 1.1. 2 Second, to
measure the effects in recommendation quality caused by the inconsistencies between CFGAN
description and its implementation. The supplemental material provided in [7] solely contain
the implementation of CFGAN and its data splitting, training, and evaluation. The details of the
experimental methodology of the evaluation study is as follows:

Datasets and Splits: The experiments used the same open-source datasets and random holdout
splits in [7], i.e., a sampled version of Ciao [10], and ML100K and ML1M versions of Movie-
lens [11]. A validation split was created for hyper-parameter tuning purposes following the
same split-creation steps as in [7].

Evaluation: All recommenders were evaluated on traditional accuracy and beyond-accuracy
metrics [2] in the standard top-N recommendation scenario. Hyper-parameters were searched
using bayesian search with 16 random cases, 50 total cases, and optimizing NDCG [2].

Baseline Recommenders: Neighborhood-based (Item KNN and User KNN) [2], graph-based
(𝑅𝑃𝛽3 ) [12], auto-encoders (SLIM ElasticNet [13] and EASE R [14]), and machine learning
recommenders (PureSVD [15] and MF BPR [16]). The description of these recommenders, their
hyper-parameters, and their ranges is found in [2].

CFGAN Recommenders: CFGAN as implemented in [7] was optimized. Two different variants
were trained using the optimal hyper-parameters of the previous: CFGAN with random noise,
and CFGAN using user identifiers as condition vectors.3


4. Results and Discussion
Table 1 shows accuracy and beyond accuracy metrics of baseline and CFGAN recommenders on
the ML1M dataset.4 Results on other datasets are consistent with this dataset except otherwise
noted. From the table, it can be seen that at least two baselines have higher accuracy metric
2
  Available online at https://www.acm.org/publications/policies/artifact-review-and-badging-current.
3
  Due to space limitations, this work omits the list of hyper-parameters of CFGAN.
4
  Due to space limitations, only a subset of accuracy and beyond-accuracy metrics are shown.
Table 1
Accuracy and beyond-accuracy metrics for tuned baselines and CFGAN on the ML1M dataset at
recommendation list length of 20. Higher accuracy values than CFGAN models reached by baselines
in bold. ItemKNN and UserKNN use asymmetric cosine. CFGAN uses early-stopping. CFGAN UI is
CFGAN with user identifiers as condition vector. CFGAN RN is CFGAN with random noise vector.
                                                                          COVERAGE
                              PRECISION     RECALL      MRR     NDCG
                                                                            ITEM
           UserKNN              0.2891       0.2570    0.6595   0.3888       0.3286
           ItemKNN              0.2600       0.2196    0.6254   0.3490       0.2097
           RP3beta              0.2758       0.2385    0.6425   0.3700       0.3427
           PureSVD              0.2913       0.2421    0.6333   0.0516       0.2439
           SLIM ElasticNet      0.3119       0.2695    0.6724   0.4123       0.3153
           MF BPR               0.2485       0.2103    0.5753   0.3242       0.3126
           EASE R               0.3171       0.2763    0.6795   0.4192       0.3338
           CFGAN                0.2955       0.2473    0.6222   0.3799       0.2167
           CFGAN UI             0.1459       0.1118    0.3695   0.1831       0.0291
           CFGAN RN             0.2915       0.2425    0.6211   0.3760       0.2021


than CFGAN. In particular, User KNN, SLIM ElasticNet, and EASE R have relative higher
NDCG than CFGAN by 2.34 %, 8.53 %, and 10.34 %, respectively. Furthermore, these more
accurate baselines also trained faster than CFGAN, with differences in training time between
two or three orders of magnitude. The results indicate that the progress claims made in [7]
could not be replicated in the experiments of this evaluation study.
   Regarding the absence of random noise, the results of the experiments are varied. Across
datasets and variants, including random noise to CFGAN (CFGAN RN in Table 1) led to both
relative increases or decreases in accuracy without a clear pattern.
   Clear patterns resulted by changing the condition vector from user profiles to user identifiers
(CFGAN UI in Table 1). Particularly, across datasets and variants, CFGAN UI consistently
obtained relative lower accuracy metrics with respect to the base CFGAN. These results impose
the following dichotomy. On one hand, using user profiles as condition vector may lead to
both networks learn a trivial solution, as discussed in Section 2. On the other hand, using
user identifiers as condition vectors when learning from pure collaborative data is possible on
CFGAN at the cost of providing accurate recommendations.
   Further studies are needed to address the recommendation quality of CFGAN and the incon-
sistencies presented in this work. For instance, a revision of the architecture of CFGAN can
be addressed in future works. In this work, the results suggest that the current architecture
does not work when changing the condition vectors to be the user identifiers. All these aspects
are still open research questions and addressing will be beneficial for the maturity of this
recommendation model.
References
 [1] F. B. Pérez Maurera, M. Ferrari Dacrema, P. Cremonesi, An evaluation study of generative
     adversarial networks for collaborative filtering, in: Advances in Information Retrieval -
     44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14,
     2022, Proceedings, Part I, volume 13185 of Lecture Notes in Computer Science, Springer,
     2022, pp. 671–685. doi:10.1007/978-3-030-99736-6\_45.
 [2] M. Ferrari Dacrema, S. Boglio, P. Cremonesi, D. Jannach, A troubling analysis of repro-
     ducibility and progress in recommender systems research, ACM Trans. Inf. Syst. 39 (2021)
     20:1–20:49. doi:10.1145/3434185.
 [3] M. Ferrari Dacrema, P. Cremonesi, D. Jannach, Are we really making much progress? A
     worrying analysis of recent neural recommendation approaches, in: Proceedings of the
     13th ACM Conference on Recommender Systems, RecSys 2019, Copenhagen, Denmark,
     September 16-20, 2019, ACM, 2019, pp. 101–109. doi:10.1145/3298689.3347058.
 [4] J. Lin, The neural hype and comparisons against weak baselines, SIGIR Forum 52 (2019)
     40–51. doi:10.1145/3308774.3308781.
 [5] J. Lin, The neural hype, justified! a recantation, SIGIR Forum 53 (2021) 88–93. doi:10.
     1145/3458553.3458563.
 [6] W. Yang, K. Lu, P. Yang, J. Lin, Critically examining the "neural hype": Weak baselines and
     the additivity of effectiveness gains from neural ranking models, in: Proceedings of the
     42nd International ACM SIGIR Conference on Research and Development in Information
     Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, ACM, 2019, pp. 1129–1132. doi:10.
     1145/3331184.3331340.
 [7] D. Chae, J. Kang, S. Kim, J. Lee, CFGAN: A generic collaborative filtering framework
     based on generative adversarial networks, in: Proceedings of the 27th ACM International
     Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October
     22-26, 2018, ACM, 2018, pp. 137–146. doi:10.1145/3269206.3271743.
 [8] Z. C. Lipton, J. Steinhardt, Troubling trends in machine learning scholarship, ACM Queue
     17 (2019) 80. doi:10.1145/3317287.3328534.
 [9] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, N. D. Lawrence, Dataset Shift in
     Machine Learning, The MIT Press, 2009.
[10] J. Tang, H. Gao, H. Liu, mTrust: discerning multi-faceted trust in a connected world,
     in: Proceedings of the Fifth International Conference on Web Search and Web Data
     Mining, WSDM 2012, Seattle, WA, USA, February 8-12, 2012, ACM, 2012, pp. 93–102.
     doi:10.1145/2124295.2124309.
[11] F. M. Harper, J. A. Konstan, The MovieLens datasets: History and context, ACM Trans.
     Interact. Intell. Syst. 5 (2016) 19:1–19:19. doi:10.1145/2827872.
[12] F. Christoffel, B. Paudel, C. Newell, A. Bernstein, Blockbusters and wallflowers: Accurate,
     diverse, and scalable recommendations with random walks, in: Proceedings of the 9th
     ACM Conference on Recommender Systems, RecSys 2015, Vienna, Austria, September
     16-20, 2015, ACM, 2015, pp. 163–170. doi:10.1145/2792838.2800180.
[13] X. Ning, G. Karypis, SLIM: sparse linear methods for top-n recommender systems, in:
     11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada,
     December 11-14, 2011, IEEE Computer Society, 2011, pp. 497–506. doi:10.1109/ICDM.
     2011.134.
[14] H. Steck, Embarrassingly shallow autoencoders for sparse data, in: The World Wide
     Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, ACM, 2019, pp.
     3251–3257. doi:10.1145/3308558.3313710.
[15] P. Cremonesi, Y. Koren, R. Turrin, Performance of recommender algorithms on top-n
     recommendation tasks, in: Proceedings of the 2010 ACM Conference on Recommender
     Systems, RecSys 2010, Barcelona, Spain, September 26-30, 2010, ACM, 2010, pp. 39–46.
     doi:10.1145/1864708.1864721.
[16] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme, BPR: bayesian personalized
     ranking from implicit feedback, in: UAI 2009, Proceedings of the Twenty-Fifth Conference
     on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009, AUAI
     Press, 2009, pp. 452–461.