=Paper= {{Paper |id=Vol-2431/paper6 |storemode=property |title=Data Pruning in Recommender Systems Research: Best-Practice or Malpractice? |pdfUrl=https://ceur-ws.org/Vol-2431/paper6.pdf |volume=Vol-2431 |authors=Joeran Beel,Victor Brunel |dblpUrl=https://dblp.org/rec/conf/recsys/BeelB19 }} ==Data Pruning in Recommender Systems Research: Best-Practice or Malpractice?== https://ceur-ws.org/Vol-2431/paper6.pdf

Data Pruning in Recommender
Systems Research: Best-Practice or
Malpractice?

Joeran Beel∗ Victor Brunel
Trinity College Dublin, School of Computer Polytech Clermont-Ferrand, Department of
Science & Statistics, ADAPT Centre Mathematical Engineering and Modeling
Dublin, Ireland Clermont-Ferrand, France
beelj@tcd.ie victor.brunel@etu.uca.fr

ABSTRACT
Figure 1: Distribution of users and ratings Many recommender-system datasets are pruned, i.e. some data is removed that wouldn’t be removed
in the MovieLens dataset. 16% of users in a production recommender-system. For instance, MovieLens contains only data from users who
rated less than 10 movies, 26% rated be- rated 20 or more movies.1 Similarly, some researchers prune data themselves, and conduct experiments
tween 10 and 19 movies. In most Movie-
only on subsets of the original data, sometimes as little as 0.58%. We conduct a study on data pruning,
Lens releases (100k, 1m, 10m, ...), these 42%
and find that 48% of researchers used pruned datasets. MovieLens was the most used dataset (40%)
of users and their ratings are not included.
3% of users have 500 or more ratings and and can be considered as a defacto standard dataset. Based on MovieLens, we find that removing
contribute 28% of all ratings in MovieLens. users with less than 20 ratings is equivalent to removing 5% of ratings and 42% of users. Ignoring
these users may not be ideal as users with less than 20 ratings have an RMSE of 1.03 on average, i.e.
23% worse than users with 20+ ratings (0.84). We discuss the results and conclude that pruning should
∗ This publication has emanated from research be avoided, if possible, though more discussion in the community is needed.
conducted with the financial support of Science
Foundation Ireland (SFI) under Grant Num- KEYWORDS
ber 13/RC/2106 and funding from the Euro- Recommender Systems, Datasets, Pruning, Data Pruning, Evaluation
pean Union and Enterprise Ireland under Grant
Number CF 2017 0303-1.
ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark
1 The ’MovieLens Latest Full’ dataset is not
Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International
pruned. All other variations of MovieLens are (CC BY 4.0).
pruned (100k, 1m, 10m, 20m, and ’MovieLens
Latest Small’). However, the ’MovieLens Latest
Full’ dataset is not recommended for research
as it is changing over time.
Data Pruning in Recommender Systems Research: Best-Practice or Malpractice? ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark

INTRODUCTION
’Data pruning’ is common practice in recommender-systems research. We define data pruning as
the removal of instances from a dataset that would not be removed in the real-world, i.e. when used
by recommender systems in production environments. Reasons to prune datasets are manifold and
include user interests when publishing data (e.g. data privacy) or business interests. ’Data pruning’
differs from ’data cleaning’ as such that data cleaning (e.g. outlier removal) is typically a prerequisite
for the effective training of recommender-system and machine-learning algorithms, whereas data
pruning is not affecting the algorithm performance in itself.
A prominent example is MovieLens in most of its variations1 [3]. MovieLens contains information
about how users of MovieLens.org rated movies, but the MovieLens team decided to exclude ratings of
users who rated less than 20 movies.1 The reasoning was as follows:2 "(1) [researchers] needed enough
ratings to evaluate algorithms, since most studies needed a mix of training and test data, and it is always
possible to use a subset of data when you want to study low-rating cases; and (2) the movies receiving the
first ratings for users during most of MovieLens’ history are biased based on whatever algorithm was in
place for new-user startup (for most of the site’s life, that was a mix of popularity and entropy), hence the
Figure 2: RMSE of six collaborative filter- MovieLens team didn’t want to include users who hadn’t gotten past the ’start-up’ stage [...]."
ing algorithms, and the overall average of Not only the creators of datasets may prune data, but also individual researchers may do so. For
all algorithms, for the three data splits. For instance, Caragea et al. pruned the CiteSeer corpus for their research [2]. The corpus contains a large
users with 1-19 ratings, RMSE is 1.03, com- number of research articles and their citations. Caragea et al. removed research papers with fewer
pared to 0.84 (18% difference) for users than ten and more than 100 citations as well as papers citing fewer than 15 and more than 50 research
with 20 and more ratings. The ranking of papers. From originally 1.3 million papers in the corpus around 16,000 remained (1.2%). Similarly,
the algorithms is the same on all three
Pennock et al. removed many documents so that only 0.58% remained for their research [5].
data splits (SVD++ performs best, SVD sec-
We criticized the practice of data pruning previously, particularly when only a fraction of the
ond best... Co-Clustering worst).
. original data remains [1]. We argued that evaluations based on a small fraction of the original data
2 Quote from our email exchange with Joseph A. are of little significance. For instance, knowing that an algorithm performs well for 0.58% of users
Konstan, one of the MovieLens founders. Per- is of little significance if it remains unknown how the algorithm performs for the remaining 99.42%.
mission to quote from the email was granted. Also, it is well known that collaborative filtering tends to perform poorly for users with few ratings
[4]. Hence, when evaluating collaborative filtering algorithms, we would consider it crucial to not
ignore users with few ratings, i.e. those users for whom the algorithms presumably perform poorly.
Our criticism was based more on ’gut feeling’ than scientific evidence. To the best of our knowledge,
no empirical data exists on how widely data pruning is applied, and how pruning affects recommender-
systems evaluations. Also, the recsys community has not discussed if, and to what extent a) datasets
should be pruned by their creators and b) whether individual researchers should prune data. We con-
duct the first steps towards answering these questions. Ultimately, we hope to stimulate a discussion
that leads to widely accepted guidelines on pruning data for recommender-systems research.
Data Pruning in Recommender Systems Research: Best-Practice or Malpractice? ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark

METHODOLOGY
To identify how widespread data pruning is, we analyzed all 112 full- and short papers published at
the ACM Conference on Recommender Systems 2017 and 2018. 88 papers (79%) used offline datasets,
the remaining 24 papers (21%) conducted e.g. user studies or online evaluations. For the 88 papers, we
analyzed, which datasets the authors used, whether the datasets were pruned by the original creators
and whether authors conducted pruning. To identify the latter part, we read the Methodology sections
of the manuscripts or similarly named sections.3 This analysis was done by a single person rather
quickly. Consequently, the reported numbers should be seen as ballpark figures.
To identify the effect of data pruning on recommender-system evaluations, we run six collaborative
filtering algorithms from the Surprise library, namely SVD, SVD++, NMF, Slope One, Co-Clustering,
and the Baseline Estimator. We use the unpruned ’MovieLens Latest Full’ dataset (Sept. 2018), which
contains 27 million ratings by 280,000 users including data from users with less than 20 ratings. Due
to computing constraints, we use a random sample with 6,962,757 ratings made by 70,807 users.4
We run the six algorithms on three sub-sets of the dataset, i.e. for a) the entire unpruned dataset,
b) the data that would be included in a ’normal’ version of MovieLens (users with 20+ ratings) c)
the data that would be ’normally’ not included in the MovieLens dataset (users with less than 20
ratings). We compare how algorithms perform on these different sets, and measure the performance of
algorithms by Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). As the two metrics
led to almost identical results, we only report RMSE. Our source code and analysis of the manuscripts
Figure 3: Most popular recommender sys- is available at https://github.com/BeelGroup/recsys-dataset-pruning.
tem datasets being used by the authors of
88 full- and short papers at the ACM Con- RESULTS
ference on Recommender Systems (Rec-
Sys) 2017 & 2018. Popularity of (Pruned) Recommender-Systems Datasets
. The authors of the 88 papers used a total of 64 unique datasets, whereas we counted different variations
3 We might have missed some relevant infor- of MovieLens, Amazon and Yahoo! as the same dataset. Our analysis empirically confirms what is
mation on pruning if that information was pro- common wisdom in the recommender-system community already: MovieLens is the de-facto standard
vided in a section other than the Methodology dataset in recommender-systems research. 40% of the full- and short papers at RecSys 2017 and
section. 2018 used the MovieLens dataset in at least one of its variations (Figure 3). The second most popular
4 For users with only 1 rating, ’Surprise’ uses a dataset is Amazon, which was used by 35% of all authors. Other popular datasets are shown in Figure
special technique for evaluations. Please also 3 and include Yelp (13%), Tripadvisor (8%), Yahoo! (6%), BookCrossing (5%), Epinions (5%), and LastFM
note that using a random sample is not an ex- (5%). 11% of all researchers used a proprietary dataset, and 2% used a synthetic dataset.
ample of data pruning. 50% of the authors conducted research with a single dataset, 31% used two datasets, and only 2%
used six or more datasets (Figure 4). The highest number of datasets being used was 7. On average,
researchers used 1.88 datasets. 40% of the authors used a pruned dataset, and 15% pruned data
themselves. In total, 48% of all authors conducted research at least partially with pruned data.
Data Pruning in Recommender Systems Research: Best-Practice or Malpractice? ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark

The Effect of Data Pruning
The user-rating distribution in MovieLens follows a long-tail distribution (Figure 1). 42% of the users
have less than 20 ratings, and these users contribute 5% of all ratings. The remaining 58% of users, with
20+ ratings, contribute 95% of all ratings. The top 3% of users – those with 500+ ratings – contribute
28% of all ratings in the dataset. Consequently, using a pruned MovieLens variation (100k, 1m, ...) is
equivalent to ignoring around 5% of the ratings and 42% of users.5
There are notable differences for the three data splits in terms of algorithm performance (Figure 2).
Over the entire unpruned data, RMSE of the six algorithms is 0.86 on average, with the best algorithm
being SVD++ (0.80), closely followed by SVD (0.81). The worst performing algorithm is Co-Clustering
(0.90). For the subset of ratings from users with 20+ ratings – that equals a ’normal’ MovieLens dataset
– RMSE over all algorithms is 0.84 on average (2.12% lower, i.e. better). In other words, using a pruned
version of MovieLens will lead, on average, to a 2.12% better RMSE compared to using the unpruned
data. But, to make this clear, the algorithms do not actually perform 2.12% better. The results only
appear to be better because data for which the algorithms tend to perform poorly was excluded in the
evaluation. The ranking of the algorithms remains the same when comparing the pruned with the
unpruned data (SVD++ performs best, followed by SVD, and Co-Clustering performs worst).
Figure 4: Number of datasets being used We also looked at the users grouped by the number of ratings per user6 . Figure 5 shows the RMSE
by authors of the ACM Conference on Rec-
for users with 1-9 ratings, 10-19 ratings, ... 500+ ratings. There is a constant improvement (i.e. decrease)
ommender Systems. 50% of authors used
in RMSE the more ratings user have. On average, the six algorithms achieve an RMSE of 1.03 for
just one dataset, 31% used two datasets,
and 18% used three or more datasets. users with less than 20 ratings (1.07 for users with <=9 ratings; 1.02 for users with 10–19 ratings). This
contrasts an average RMSE of 0.84 for users with 20+ ratings. In other words, RMSE for users in a
5 The actual numbers in the pruned MovieLens
pruned MovieLens dataset is 23% better than RMSE for the excluded users. For SVD and SVD++, the
versions may somewhat differ given that we
best performing algorithms, this effect is even stronger (+27% for SVD; +25% for SVD++).
just used a sample, and the different MovieLens
versions differ due to the fact that they include
data from different time periods. DISCUSSION & FUTURE WORK
6 It is actually not trivial to decide how to split Data Pruning is a widespread phenomenon with 48% of short- and full papers at RecSys 2017/2018
the data in this case. Follow-up research is being based at least partially on pruned data. MovieLens nicely illustrates an issue that probably
needed to confirm the numbers, and investi- applies to many datasets with user-rating data. In the pruned MovieLens datasets, the number of
gate different options. removed ratings is rather small (5%). However, these 5% ratings were made by 42% of the users. For
researchers focusing on how well individual ratings can be predicted, the removed data has probably
little impact. For researchers who focus on user-specific issues, ignoring 42% of users is probably not
ideal, particularly as their RMSE is 23% worse than the RMSE of the users with 20+ ratings.
When discussing data pruning, the probably most important question is whether pruning changes
the ranking of the evaluated algorithms. Other research has already shown that the ranking may
7 The authors evaluated different algorithms. change, though that research was not conducted in the context of data pruning [4].7 In our study,
For users with few ratings (<8), other algo-
rithms performed best than for users with more
ratings.
Data Pruning in Recommender Systems Research: Best-Practice or Malpractice? ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark

the ranking of algorithms did not change. The algorithm that was best (second best...) on the pruned
MovieLens data was also best (second best...) on the unpruned data. However, it seems likely to us
that rankings may change if more diverse algorithms are compared, e.g. collaborative filtering vs.
content-based filtering. Also, the MovieLens dataset is relatively moderately pruned. We consider it
likely that heavy pruning, where only a small fraction remains (e.g. Pennock et al. [5]), might lead to a
change in the ranking of algorithms. More research with more diverse algorithms, different datasets,
and different degrees of pruning is needed to confirm or reject that assumption. A qualitative study
could help to identify details on the motivation of dataset creators and researchers to prune data.
Given the current results, we propose that data pruning should be applied with great care, or, if
possible, be avoided. We would not generally consider pruning as a malpractice, but certainly not a
best-practice either. In some cases, especially when large parts of data are removed, data pruning
may become a malpractice, though the community yet has to determine how much removed data is
too much. As a starting point, we would recommend the following guidelines, though this is certainly
not a definite recommendation, and more discussion in the community is needed:
(1) Publishers of datasets should avoid pruning – if possible. If there are compelling reasons to
prune data (e.g. ensuring privacy), these should be clearly communicated in the documentation.
(2) Researchers using pruned datasets should discuss the implications in their manuscript.
(3) Individual researchers should not prune data. If they feel that an algorithm may perform well on
a subset of the data, they should report performance for both the entire dataset and the subset.
If researchers conduct pruning anyway, they should clearly indicate this in the manuscript,
provide reasoning, and discuss the implications.
It may not always be obvious where data cleaning ends, legitimate data pruning begins, and when
data pruning becomes a malpractice. The community certainly needs more discussion about this issue.
We are confident that with widely agreed guidelines on data pruning, recommender-systems research
will become more reproducible, more comparable and more representative of ’the real world’.

REFERENCES
[1] Joeran Beel, Bela Gipp, Stefan Langer, and Corinna Breitinger. 2016. Research Paper Recommender Systems: A Literature
Survey. International Journal on Digital Libraries 4 (2016), 305–338. https://doi.org/10.1007/s00799-015-0156-0
[2] Cornelia Caragea, Adrian Silvescu, Prasenjit Mitra, and C Lee Giles. 2013. Can’t See the Forest for the Trees? A Citation
Recommendation System. In iConference. 849–851. https://doi.org/10.9776/13434
[3] F Maxwell Harper and Joseph A Konstan. 2016. The movielens datasets: History and context. ACM Transactions on
Interactive Intelligent Systems (TiiS) 5, 4 (2016), 19.
[4] Daniel Kluver and Joseph A Konstan. 2014. Evaluating recommender behavior for new users. In Proceedings of the 8th ACM
Conference on Recommender Systems. ACM, 121–128.
Figure 5: RMSE of the six collaborative fil- [5] David M Pennock, Eric Horvitz, Steve Lawrence, and C Lee Giles. 2000. Collaborative filtering by personality diagnosis: A
tering algorithms and the overall average, hybrid memory-and model-based approach. In Sixteenth Conference on Uncertainty in Artificial Intelligence. 473–480.
by users’ number of ratings. Algorithms
applied to users with 1-9 ratings achieve
an RMSE of 1.07 on average. Algorithms
applied to users with 500+ ratings achieve
an RMSE of 0.82 on average (37% better).