=Paper=
{{Paper
|id=Vol-3177/paper20
|storemode=property
|title=Replication of Recommender Systems with Impressions
|pdfUrl=https://ceur-ws.org/Vol-3177/paper20.pdf
|volume=Vol-3177
|authors=Fernando Benjamín Pérez Maurera,Maurizio Ferrari Dacrema,Paolo Cremonesi
|dblpUrl=https://dblp.org/rec/conf/iir/MaureraDC22a
}}
==Replication of Recommender Systems with Impressions==
<pdf width="1500px">https://ceur-ws.org/Vol-3177/paper20.pdf</pdf>
<pre>
Replication of Recommender Systems with
Impressions
Discussion Paper

Fernando B. Pérez Maurera1,2,* , Maurizio Ferrari Dacrema1 and Paolo Cremonesi1
1
    Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano, Italy
2
    ContentWise, Via Simone Schiaffino 11, Milano, 20158, Milano, Italy


                                         Abstract
                                         Impressions are a novel data type in Recommender Systems containing the previously-exposed items,
                                         i.e., what was shown on-screen. Due to their novelty, the current literature lacks a characterization
                                         of impressions, and replications of previous experiments. Also, previous research works have mainly
                                         used impressions in industrial contexts or recommender systems competitions, such as the ACM RecSys
                                         Challenges. This work is part of an ongoing study about impressions in recommender systems. It
                                         presents an evaluation of impressions recommenders on current open datasets, comparing not only the
                                         recommendation quality of impressions recommenders against strong baselines, but also determining if
                                         previous progress claims can be replicated.

                                         Keywords
                                         Recommender Systems, Impressions, Exposure, Replication, Collaborative Filtering


1. Introduction
A recurrent and fundamental task in Recommender System (RS) is the empirical evaluation
of recommendation models with varied data sources. One particular novel and modestly
explored data source in RS research are impressions. These contains not only the previous
interactions (e.g., purchases and clicks) of users but also the items they were presented with
(e.g., recommendations and search results). Previous research works [1, 2, 3, 4] have proposed
recommendations models that leverage impressions data, called impressions recommenders. To
current date, no previous work has tried to replicate these models on open datasets.1
   The replication of previous works is fundamental to measure the current status of recom-
mendation models across different domains and data sources. Previous research works have
highlighted the importance of replication works for the RS community [5, 6, 7, 8, 9]. To address
this existing gap in the literature, this work presents a replication study of four impressions

IIR2022: 12th Italian Information Retrieval Workshop, June 29 - June 30th, 2022, Milan, Italy
*
  Corresponding author.
$ fernandobenjamin.perez@polimi.it (Fernando B. Pérez Maurera); maurizio.ferrari@polimi.it (M. Ferrari
Dacrema); paolo.cremonesi@polimi.it (P. Cremonesi)
 0000-0001-6578-7404 (Fernando B. Pérez Maurera); 0000-0001-7103-2788 (M. Ferrari Dacrema);
0000-0002-1253-8081 (P. Cremonesi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
              CEUR Workshop Proceedings (CEUR-WS.org)
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


1
    The concept of “replicability” is the same as in the ACM Artifact Review and Badging, version 1.1, available online
    at https://www.acm.org/publications/policies/artifact-review-and-badging-current.
recommenders.2 First, this work presents a brief categorization of impressions data, as the cur-
rent literature does not have one. Second, this work empirically evaluates the recommendation
quality of several baseline and impressions recommenders on current open-source impressions
datasets and compares the obtained results with the claims given in the original works.


2. Impressions in Recommender Systems
Impressions are a novel and modestly used data source that contains the items shown on-screen
to users, e.g., the items that users were presented when browsing an e-commerce service.
Similar to interactions data in RS, an impression is characterized as an user-item pair (𝑢, 𝑖),
indicating that user 𝑢 has been impressed with item 𝑖. Importantly, previous research works with
impressions have been in the context of industrial settings or RS competitions. Hence, progress
in impressions research has been mostly slow. The following presents a brief categorization of
impressions:

Signals: The signals within impressions are mixed, i.e., impressions may reflect both positive
and negative users preferences toward items, mostly depending on the provenance of the
impressions, e.g., a recommender system or business rules. There is no consensus in the current
literature regarding the meaning of impressions. For instance, in the same context, previous
research works have used impressions as positive [10] or negative [11] signals.

Challenges: Three main considerations should be taken into account when working with
impressions data. First, the heterogeneous signals within impressions. Second, scalability as the
number of impressions records might be orders of magnitude greater than interactions. Third,
the effects of feedback loops between users and recommendation systems.

Impressions Recommenders: Two types of impressions recommenders have been proposed
in previous research works: re-ranking and impressions as user profiles recommenders. The
first group re-scores the preference scores of an existing recommendation model based on
impressions data and features extracted from impressions [3, 1, 12, 13]. The second group
expands the user profiles (interactions) with impressions data [14].

Impressions Datasets: Three datasets from different recommendation domains are open-
source and can be used in research activities: ContentWise Impressions (TV and movies),
MIND (news), and FINN.no Slates (e-commerce). Private and non-distributable datasets also
exist and have been used in previous works [1, 12, 13, 15, 16]. However, due to their nature or
license agreements, it is not possible to use them in newer research works.

Evaluation of Impressions: No evaluation and comparison of impressions recommenders on
open datasets exists in the current literature. Currently, research works with impressions have
worked on two contexts: recommendation challenges [14, 17, 18, 11] or industrial scenarios [13,
1, 19, 12]. In the former, complex recommendation models are built and tested against an specific
dataset without assessing the generalization aspects of impressions on other areas or domains.

2
    This work is part of an ongoing study about impressions in recommender systems
In the latter, impressions are studied on private data and recommendation systems. No previous
work have performed ablation studies to assess the impact of impressions.


3. Experimental Methodology
This work presents several experiments on impressions recommenders, particularly, when
used as a plug-in to existing recommendation models, i.e., impressions recommenders alter
the preference scores of recommendation models. The goal of these experiments is two-fold.
First, to determine the recommendation quality of impressions recommenders on open-source
impressions datasets. Second, to replicate, if possible, the progress achieved by impressions
recommenders in their original works. The experiments followed the following experimental
methodology:

Datasets, Processing, and Splits: The three available open-source datasets with impressions
were used in the experiments: ContentWise Impressions, MIND, and FINN.no Slates. The
following processing was applied to all datasets: (i) data records were sorted in ascending order
by their time attribute; (ii) duplicated user-item interactions were aggregated into a single one,
keeping the data of the first interaction; (iii) interactions and impressions of users without a
minimum of three interactions were removed; (iv) the training, validation, and testing splits
were created following a traditional leave-last-interaction out.
Evaluation: All recommenders were evaluated on traditional accuracy and beyond-accuracy
metrics [5] in the standard top-N recommendation scenario. Hyper-parameters were searched
using bayesian search with 16 random cases, 50 total cases, and optimizing NDCG [5] on the
validation set.
Baseline Recommenders: Neighborhood-based (Item KNN and User KNN) [5], graph-
based (𝑅𝑃𝛽3 ) [20], auto-encoders (SLIM ElasticNet [21] and EASE R [22]), machine learning
(PureSVD [23] and MF BPR [24]), and factorization machines recommenders (Light FM) [25].
The description of these recommenders, their hyper-parameters, and their ranges is found in [5].
Impressions Recommenders: Re-ranking (Cycling [3] and Impressions Discounting [1]),
and impressions as user profiles recommenders (Item Weighted Profiles and User Weighted
Profiles) [14].3


4. Results and Discussion
The accuracy and beyond accuracy of impressions recommenders varied by dataset, baseline,
and impressions recommender. All impressions recommenders achieved higher NDCG than
baselines on the FINN.no Slates dataset. On other datasets, impressions recommenders achieved
slightly higher NDCG than baseline recommenders in some cases. Such cases are shown in
Table 1. This shows the NDCG of the base and impressions recommenders on the MIND dataset.4
3
    Due to space limitations, this work omits the list of hyper-parameters of impressions recommenders.
4
    Recommenders were evaluated on more metrics. Due to space limitations Table 1 only contains the results on
    NDCG.
Table 1
Top-20 ranking accuracy measured with NDCG of base and impressions recommenders on the MIND
dataset. MF BPR, NMF, and PureSVD are folded recommenders [23]. Values in bold mean higher
accuracy than Baseline. ID refers to impressions discounting using the frequency of impressions. IUP
refers to impressions as user profiles. x means the case was not explored due to incompatibility of the
base recommender and the impressions recommender. - means explored cases yielded the same results.
                                        Baseline     Cycling       ID        IUP
                  Item KNN               0.00868     0.00693    0.00028    0.00012
                  User KNN               0.00766     0.01797    0.01118    0.06681
                  MF BPR                 0.00002     0.00680    0.00424       -
                  NMF                    0.00116     0.00797       -       0.00098
                  PureSVD                0.00010     0.00728       -       0.00015
                  𝑅𝑃𝛽3                   0.01643     0.00720     0.00015    0.00009
                  SLIM ElasticNet        0.01493     0.00699     0.00060    0.00010
                  Light FM               0.00160     0.00705     0.00101       x


From the table, a notable case is the use of impressions as user profiles (IUP) with User KNN
on the MIND dataset. Particularly, this case obtained eight and four times higher NDCG than
the base (User KNN) and best (𝑅𝑃𝛽3 ) baseline recommender, respectively.
   When looking at each impressions recommender, the Cycling recommender achieved higher
NDCG on the FINN.no Slates and MIND datasets. Although, on the latter, this only occurred on
matrix factorization and factorization machines recommenders. The Impressions Discounting,
Item Weighted Profiles, and User Weighted Profiles recommenders did not have such
consistent results. For instance, the former achieved higher NDCG than User KNN but obtained
lower NDCG than Item KNN on the MIND.
   Regarding the replicability of impressions recommenders, Cycling recommended less accu-
rate but more diverse items on the ContentWise Impressions dataset. This result is aligned
with the conclusions of [3], which performed experiments on a different dataset of the same
domain. For Impressions Discounting, only the results on the FINN.no Slates dataset are
aligned with the conclusions of [1]. However, in the reference article, the experimental method-
ology was on error prediction (RMSE) instead of top-N recommendations. The remaining
impressions recommenders could not be replicated due to lack of replicability information.
   Regarding to the signals within impressions, the results varied mostly by dataset while
the recommenders did not play a major role. For the ContentWise Impressions dataset,
impressions cannot be considered as positive or negative, as substantially higher NDCG was
not achieved by any recommender treating impressions as positive or negative signals. For the
MIND and FINN.no Slates datasets, impressions were considered as positive signals in most
recommenders while at the same time achieving higher NDCG than the base recommender.
References
 [1] P. Lee, L. V. S. Lakshmanan, M. Tiwari, S. Shah, Modeling impression discounting in large-
     scale recommender systems, in: S. A. Macskassy, C. Perlich, J. Leskovec, W. Wang, R. Ghani
     (Eds.), The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data
     Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, ACM, 2014, pp. 1837–1846.
     URL: https://doi.org/10.1145/2623330.2623356. doi:10.1145/2623330.2623356.
 [2] D. Zibriczky, A combination of simple models by forward predictor selection for job
     recommendation, in: Proceedings of the 2016 Recommender Systems Challenge, RecSys
     Challenge 2016, Boston, Massachusetts, USA, September 15, 2016, ACM, 2016, pp. 9:1–9:4.
     URL: https://doi.org/10.1145/2987538.2987548. doi:10.1145/2987538.2987548.
 [3] Q. Zhao, G. Adomavicius, F. M. Harper, M. C. Willemsen, J. A. Konstan, Toward better
     interactions in recommender systems: Cycling and serpentining approaches for top-n
     item lists, in: C. P. Lee, S. E. Poltrock, L. Barkhuus, M. Borges, W. A. Kellogg (Eds.),
     Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work
     and Social Computing, CSCW 2017, Portland, OR, USA, February 25 - March 1, 2017,
     ACM, 2017, pp. 1444–1453. URL: https://doi.org/10.1145/2998181.2998211. doi:10.1145/
     2998181.2998211.
 [4] M. Aharon, Y. Kaplan, R. Levy, O. Somekh, A. Blanc, N. Eshel, A. Shahar, A. Singer,
     A. Zlotnik, Soft frequency capping for improved ad click prediction in yahoo gemini native,
     in: Proceedings of the 28th ACM International Conference on Information and Knowledge
     Management, CIKM 2019, Beijing, China, November 3-7, 2019, ACM, 2019, pp. 2793–2801.
     URL: https://doi.org/10.1145/3357384.3357801. doi:10.1145/3357384.3357801.
 [5] M. F. Dacrema, S. Boglio, P. Cremonesi, D. Jannach, A troubling analysis of reproducibility
     and progress in recommender systems research, ACM Trans. Inf. Syst. 39 (2021) 20:1–20:49.
     URL: https://doi.org/10.1145/3434185. doi:10.1145/3434185.
 [6] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we really making much progress? A
     worrying analysis of recent neural recommendation approaches, in: T. Bogers, A. Said,
     P. Brusilovsky, D. Tikk (Eds.), Proceedings of the 13th ACM Conference on Recommender
     Systems, RecSys 2019, Copenhagen, Denmark, September 16-20, 2019, ACM, 2019, pp. 101–
     109. URL: https://doi.org/10.1145/3298689.3347058. doi:10.1145/3298689.3347058.
 [7] J. Lin, The neural hype and comparisons against weak baselines, SIGIR Forum 52 (2019)
     40–51. doi:10.1145/3308774.3308781.
 [8] J. Lin, The neural hype, justified! a recantation, SIGIR Forum 53 (2021) 88–93. doi:10.
     1145/3458553.3458563.
 [9] W. Yang, K. Lu, P. Yang, J. Lin, Critically examining the "neural hype": Weak baselines and
     the additivity of effectiveness gains from neural ranking models, in: Proceedings of the
     42nd International ACM SIGIR Conference on Research and Development in Information
     Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, ACM, 2019, pp. 1129–1132. doi:10.
     1145/3331184.3331340.
[10] C. Zhang, X. Cheng, An ensemble method for job recommender systems, in: F. Abel,
     A. A. Benczúr, D. Kohlsdorf, M. A. Larson, R. Pálovics (Eds.), Proceedings of the 2016
     Recommender Systems Challenge, RecSys Challenge 2016, Boston, Massachusetts, USA,
     September 15, 2016, ACM, 2016, pp. 2:1–2:4. URL: https://doi.org/10.1145/2987538.2987545.
     doi:10.1145/2987538.2987545.
[11] T. D. Pessemier, K. Vanhecke, L. Martens, A scalable, high-performance algorithm for
     hybrid job recommendations, in: Proceedings of the 2016 Recommender Systems Challenge,
     RecSys Challenge 2016, Boston, Massachusetts, USA, September 15, 2016, ACM, 2016, pp.
     5:1–5:4. URL: https://doi.org/10.1145/2987538.2987539. doi:10.1145/2987538.2987539.
[12] M. Hristakeva, D. Kershaw, M. Rossetti, P. Knoth, B. Pettit, S. Vargas, K. Jack, Building
     recommender systems for scholarly information, in: Proceedings of the 1st Workshop on
     Scholarly Web Mining, SWM@WSDM 2017, Cambridge, United Kingdom, February 10,
     2017, ACM, 2017, pp. 25–32. URL: https://doi.org/10.1145/3057148.3057152. doi:10.1145/
     3057148.3057152.
[13] D. Agarwal, B. Chen, R. Gupta, J. Hartman, Q. He, A. Iyer, S. Kolar, Y. Ma, P. Shivaswamy,
     A. Singh, L. Zhang, Activity ranking in linkedin feed, in: S. A. Macskassy, C. Perlich,
     J. Leskovec, W. Wang, R. Ghani (Eds.), The 20th ACM SIGKDD International Conference
     on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 -
     27, 2014, ACM, 2014, pp. 1603–1612. URL: https://doi.org/10.1145/2623330.2623362. doi:10.
     1145/2623330.2623362.
[14] M. Polato, F. Aiolli, A preliminary study on a recommender system for the job recommen-
     dation challenge, in: Proceedings of the 2016 Recommender Systems Challenge, RecSys
     Challenge 2016, Boston, Massachusetts, USA, September 15, 2016, ACM, 2016, pp. 1:1–1:4.
     URL: https://doi.org/10.1145/2987538.2987549. doi:10.1145/2987538.2987549.
[15] F. Abel, A. A. Benczúr, D. Kohlsdorf, M. A. Larson, R. Pálovics, Recsys challenge 2016:
     Job recommendations, in: S. Sen, W. Geyer, J. Freyne, P. Castells (Eds.), Proceedings
     of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September
     15-19, 2016, ACM, 2016, pp. 425–426. URL: https://doi.org/10.1145/2959100.2959207. doi:10.
     1145/2959100.2959207.
[16] P. Knees, Y. Deldjoo, F. B. Moghaddam, J. Adamczak, G.-P. Leyson, P. Monreal, Recsys
     challenge 2019: Session-based hotel recommendations, in: Proceedings of the 13th ACM
     Conference on Recommender Systems, RecSys ’19, Association for Computing Machinery,
     New York, NY, USA, 2019, p. 570–571. URL: https://doi.org/10.1145/3298689.3346974. doi:10.
     1145/3298689.3346974.
[17] E. D’Amico, G. Gabbolini, D. Montesi, M. Moreschini, F. Parroni, F. Piccinini, A. Rossettini,
     A. R. Introito, C. Bernardis, M. F. Dacrema, Leveraging laziness, browsing-pattern aware
     stacked models for sequential accommodation learning to rank, in: P. Knees, Y. Deldjoo, F. B.
     Moghaddam, J. Adamczak, G. P. Leyson, P. Monreal (Eds.), Proceedings of the Workshop
     on ACM Recommender Systems Challenge, Copenhagen, Denmark, September 2019, ACM,
     2019, pp. 7:1–7:5. URL: https://doi.org/10.1145/3359555.3359563. doi:10.1145/3359555.
     3359563.
[18] J. I. Honrado, O. Huarte, C. Jimenez, S. Ortega, J. R. Pérez-Agüera, J. Pérez-Iglesias, Á. Polo,
     G. Rodríguez, Jobandtalent at recsys challenge 2016, in: F. Abel, A. A. Benczúr, D. Kohlsdorf,
     M. A. Larson, R. Pálovics (Eds.), Proceedings of the 2016 Recommender Systems Challenge,
     RecSys Challenge 2016, Boston, Massachusetts, USA, September 15, 2016, ACM, 2016, pp.
     3:1–3:5. URL: https://doi.org/10.1145/2987538.2987547. doi:10.1145/2987538.2987547.
[19] Q. Zhao, M. C. Willemsen, G. Adomavicius, F. M. Harper, J. A. Konstan, Interpreting
     user inaction in recommender systems, in: Proceedings of the 12th ACM Conference on
     Recommender Systems, ACM, Vancouver British Columbia Canada, 2018, pp. 40–48. URL:
     https://dl.acm.org/doi/10.1145/3240323.3240366. doi:10.1145/3240323.3240366.
[20] F. Christoffel, B. Paudel, C. Newell, A. Bernstein, Blockbusters and wallflowers: Accurate,
     diverse, and scalable recommendations with random walks, in: H. Werthner, M. Zanker,
     J. Golbeck, G. Semeraro (Eds.), Proceedings of the 9th ACM Conference on Recommender
     Systems, RecSys 2015, Vienna, Austria, September 16-20, 2015, ACM, 2015, pp. 163–170.
     URL: https://doi.org/10.1145/2792838.2800180. doi:10.1145/2792838.2800180.
[21] X. Ning, G. Karypis, SLIM: sparse linear methods for top-n recommender systems, in: D. J.
     Cook, J. Pei, W. Wang, O. R. Zaïane, X. Wu (Eds.), 11th IEEE International Conference on
     Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011, IEEE Computer
     Society, 2011, pp. 497–506. URL: https://doi.org/10.1109/ICDM.2011.134. doi:10.1109/
     ICDM.2011.134.
[22] H. Steck, Embarrassingly shallow autoencoders for sparse data, in: L. Liu, R. W. White,
     A. Mantrach, F. Silvestri, J. J. McAuley, R. Baeza-Yates, L. Zia (Eds.), The World Wide Web
     Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, ACM, 2019, pp. 3251–
     3257. URL: https://doi.org/10.1145/3308558.3313710. doi:10.1145/3308558.3313710.
[23] P. Cremonesi, Y. Koren, R. Turrin, Performance of recommender algorithms on top-n
     recommendation tasks, in: X. Amatriain, M. Torrens, P. Resnick, M. Zanker (Eds.), Pro-
     ceedings of the 2010 ACM Conference on Recommender Systems, RecSys 2010, Barcelona,
     Spain, September 26-30, 2010, ACM, 2010, pp. 39–46. URL: https://doi.org/10.1145/1864708.
     1864721. doi:10.1145/1864708.1864721.
[24] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme, BPR: bayesian personalized
     ranking from implicit feedback, in: J. A. Bilmes, A. Y. Ng (Eds.), UAI 2009, Proceedings
     of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC,
     Canada, June 18-21, 2009, AUAI Press, 2009, pp. 452–461. URL: https://dslpitt.org/uai/
     displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=1630&proceeding_id=25.
[25] M. Kula, Metadata embeddings for user and item cold-start recommendations, in: T. Bogers,
     M. Koolen (Eds.), Proceedings of the 2nd Workshop on New Trends on Content-Based
     Recommender Systems co-located with 9th ACM Conference on Recommender Systems
     (RecSys 2015), Vienna, Austria, September 16-20, 2015, volume 1448 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2015, pp. 14–21. URL: http://ceur-ws.org/Vol-1448/paper4.pdf.

</pre>