=Paper=
{{Paper
|id=Vol-2450/short2
|storemode=property
|title=Visualizing Ratings in Recommender System Datasets
|pdfUrl=https://ceur-ws.org/Vol-2450/short2.pdf
|volume=Vol-2450
|authors=Diego Monti,Giuseppe Rizzo,Maurizio Morisio
|dblpUrl=https://dblp.org/rec/conf/recsys/Monti0M19
}}
==Visualizing Ratings in Recommender System Datasets==
Visualizing Ratings in Recommender System Datasets Diego Monti Giuseppe Rizzo Maurizio Morisio Politecnico di Torino LINKS Foundation Politecnico di Torino diego.monti@polito.it giuseppe.rizzo@linksfoundation.com maurizio.morisio@polito.it ABSTRACT A possible solution to this problem could be represented by The numerical outcome of an offline experiment involving differ- data visualization techniques [7]. However, most of the methods ent recommender systems should be interpreted also considering available in literature are designed to display the output of a recom- the main characteristics of the available rating datasets. However, mendation model and not the original dataset [1, 6]. In contrast, we existing metrics usually exploited for comparing such datasets like argue that it is necessary to visually explore a rating dataset even sparsity and entropy are not enough informative for reliably under- before it is used to train a recommender system, for understanding standing all their peculiarities. In this paper, we propose a qualita- how the input data will influence the outputs under analysis. tive approach for visualizing different collections of user ratings in In this paper, we propose a novel qualitative approach based on an intuitive and comprehensible way, independently from a specific data visualization for creating a graphical summary of any collec- recommendation algorithm. Thanks to graphical summaries of the tion of user preferences. This method is useful for visually identify- training data, it is possible to better understand the behaviour of ing similarities and differences among the available datasets. In fact, different recommender systems exploiting a given dataset. Further- we argue that if two datasets result in similar visualizations, the more, we introduce RS-viz, a Web-based tool that implements the behavior of different recommender systems relying on them will described method and that can easily create an interactive 3D scat- be consistent. Furthermore, we present a Web-based tool, named ter plot starting from any collection of user ratings. We compared RS-viz, for easily constructing the proposed visualization and com- the results obtained during an offline evaluation campaign with the paring rating datasets in an intuitive way. RS-viz is freely available corresponding visualizations generated from the HetRec LastFM on GitHub1 and its usage is described in an introductory video.2 dataset for validating the effectiveness of the proposed approach. Differently from the plotting capabilities already available in specialized software like Matlab or Scilab, our approach is more KEYWORDS general, as it can be applied in a consistent way by different users on any dataset and it can be exploited on many devices without the Visualization Tool, Rating Dataset, Offline Evaluation need of installing specific tools. The remainder of this paper is structured as follows: in Section 2 1 INTRODUCTION we review existing visualization techniques in the context of rating Being able to correctly interpreting the results obtained during datasets, while in Section 3 we present the approach used to con- an offline evaluation of different recommender systems is of para- struct the scatter plot and we describe the implementation details mount importance for understanding the quality of the suggested of the Web-based tool RS-viz. In Section 4, we comment on the out- items [5]. However, this task is particularly difficult as it requires come of an evaluation campaign designed to validate the proposed knowing several details regarding the evaluation protocol and the method. Finally, in Section 5, we provide the conclusions and we rating dataset exploited for conducting the experiments [9]. For outline possible future works. example, sparse datasets usually yield to lower evaluation scores with respect to more dense datasets [3]. On the other hand, datasets 2 RELATED WORK with many popular items tend to advantage systems that create Different authors have proposed to create interactive visualizations less diverse suggestions [10], like the most popular baseline. There for qualitative evaluating the goodness of the recommended items are also some subtle differences among rating datasets related to or helping the users to identify the most relevant suggestions. the application domain or the collection protocol that could affect For example, Kunkel et al. [7] created a 3D map-based visualiza- the choice of the most appropriate recommender system. tion that represents the preferences of a user on the entire space of Different metrics have been proposed in literature to summarize items. The user can inspect the profile created by the recommender the main characteristic of a rating dataset, i.e. sparsity or entropy. and also manually modify it, if necessary. However, we argue that such metrics are not sufficient for com- Çoba et al. [2] extended the rrecsys library by adding to it graph- paring datasets in a reliable way, as many other facets should be ical capabilities for performing an offline visual evaluation of dif- taken into account. For example, it is not possible to understand ferent recommendation approaches with respect to the popularity the rating behaviors of specific groups of users nor the popularity of the suggested items. of the most rated items by only looking at some general statistics Gil et al. [6] introduced VisualRS, a tool capable of creating tree computed on the whole dataset. graph structures for exploring the most important relationships between items or users. The graph-based visualization is useful for Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons comparing the results of different recommendation approaches and License Attribution 4.0 International (CC BY 4.0). selecting the most appropriate one for a given task. IntRS ’19: Joint Workshop on Interfaces and Human Decision Making for Recommender Systems, 19 Sept 2019, Copenhagen, DK. 1 https://github.com/D2KLab/rs-viz 2 https://doi.org/10.6084/m9.figshare.8197706 IntRS Workshop, September 2019, Copenhagen, DK Diego Monti, Giuseppe Rizzo, and Maurizio Morisio In contrast, Cardoso et al. [1] proposed to combine the output of different recommender systems with human-generated data to allow users to explore the suggested items in an effective way. This method could also be exploited to compare the results of different recommender systems in a qualitative way. All the reviewed approaches are based on popular recommender systems. To the best of our knowledge, this paper represents the first formal attempt to visualize the ratings available in an offline dataset independently from a specific recommender system. 3 VISUALIZATION APPROACH In this section, we first describe the algorithm that we devised for creating a scatter plot that represents a rating dataset (Section 3.1), then we introduce the implementation details of RS-viz (Section 3.2). 3.1 Scatter plot construction In order to visually represent the rating matrix associated with a generic dataset we opted for a 3D scatter plot. The rationale behind this choice is that each point in the visualization could intuitively represent a single rating from the dataset: the value of the x-axis is the identifier of the user, the value of the y-axis is the identifier of the item, while the value of the z-axis is the rating itself, if it is Figure 1: The MovieLens 100K dataset. expressed on a numerical scale. However, it is easy to foresee that this approach cannot handle complex datasets with many preferences, as it requires one point for each rating. If the ratings available are only binary, a traditional scatter plot would suffice. For these reasons, we decided to create a more compact represen- tation of the rating matrix before visualizing it. In details, we first associated the users and the items with internal numerical identi- fiers according to their frequency of appearance in the dataset. For example, we associated the most rated item with the value of 1, and the second most rated item with the value of 2. The same approach was followed for ordering the identifiers of the users according to the number of ratings that they expressed. Then, we linearly normalized such identifiers within an interval ranging from 0 to a user provided value, which represents the size of a squared rating matrix in a transformed space. Finally, we binarized the ratings from the original dataset according to a user provided threshold and we counted, for each cell of the transformed matrix, the number of positive preferences associated with that cell. For example, if the user 40 expressed a preference for the item 360 in a dataset where the number of users is 941, the number of items is 1446, and the number of normalized users and items is equal to 100, that rating would be associated with the cell (4, 24) because ⌊40 ÷ 941 × 100⌋ = 4 and ⌊360 ÷ 1446 × 100⌋ = 24. Therefore, the value of the z-axis represents the number of pos- Figure 2: The MovieLens 1M dataset. itive ratings associated with a sub-matrix of the original dataset, sorted by item popularity and user activity. In order to enhance the readability of the visualization, we also represented the value of the z-axis using a logarithmic color scale. By looking at the values of the z-axis, it is possible to observe As an example of the proposed method, we report in Figure 1 in an intuitive way that MovieLens 1M contains a higher num- and Figure 2 the scatter plots obtained from the MovieLens 100K ber of popular items and of very active users. This conclusion is and MovieLens 1M datasets, when the rating threshold is equal to 3, consistent with the findings of other works that analyze the main and the number of normalized users and items is equal to 100. characteristics of the MovieLens datasets [3]. IntRS Workshop, September 2019, Copenhagen, DK Diego Monti, Giuseppe Rizzo, and Maurizio Morisio splitting protocol, the test set size as the 20% of the dataset, and the length k of the recommended lists equal to 10. We considered different recommendation approaches, namely the most popular and random baselines and the MyMediaLite [4] implementations of the Item KNN, User KNN, BPRMF, and WRMF recommender systems using their default settings. We computed the metrics of coverage, precision, recall, and NDCG. The results of these experiments are reported in Table 1. The same datasets obtained from HetRec LastFM by varying the rating threshold were exploited for creating two scatter plots using RS-viz, as displayed in Figure 4. 4.2 Discussion From the visualization provided in Figure 4a, we can observe that the HetRec LastFM dataset has a very different structure from the one of the MovieLens datasets. In fact, a limited number of items Figure 3: The configuration parameters of RS-viz. are associated with the preferences of almost all users, as it can be deduced by considering only the ratings expressed for popular items, that is the ones with low identifiers. Please note that such ratings seem not related to the identifier of the user, resulting in a 3.2 Software implementation scatter plot that resembles the shape of a half cylinder. We realized a software implementation of the proposed approach Furthermore, less popular items seem to be liked by less active as a Web-based tool, called RS-viz, which is freely available. Our users. This behavior can be observed by looking at the lower part of visualization framework has been developed using the JavaScript Figure 4a. Users with a high identifier have rated a more widespread programming language and it runs entirely in a user’s browser. For set of items, while users with a low identifier have rated popular this reason, it can also be exploited for analyzing private datasets, items more frequently. as no information about them is sent to remote servers. These differences can be easily explained if we consider the col- The user needs to visit the Web-page of RS-viz3 and select one lection protocol and the domain of the dataset under analysis. The of the built-in datasets or provide her own dataset as a CSV file. ratings in the LastFM datasets represent the number of times a user Then, she needs to specify the threshold between positive and listened to a particular artist: therefore, they were collected in an negative ratings and the number of normalized users and items, implicit way and their values range from one to tens of thousands. which should be selected also considering the rating scale of the Also the strange area in the plot with almost no preferences is a input dataset and the desired visualization density. A screenshot direct result of the collection protocol, which relied on the LastFM of the form containing the configuration parameters of RS-viz is website to obtain the top artists for a set of users. In fact, the list of reported in Figure 3. artists available in the dataset is limited to 50 items for each user. After a few seconds, an interactive 3D scatter plot is constructed If we increase the value of the rating threshold, we can observe on the right side of the page. The user can inspect the plot by that the resulting scatter plot represented in Figure 4b is more rotating the camera and finally save the result as a PNG file. similar to the ones of the MovieLens datasets, resulting in a very typical long tail distribution with respect to both the items and 4 EVALUATION CAMPAIGN the users. This outcome is due to the fact that we removed ratings In the following, we report the numerical outcomes of an evaluation produced by more casual listeners. campaign conducted on the HetRec LastFM dataset using different From the numerical outcomes of the experiments, we can deduce recommendation approaches with the purpose of understanding that the User KNN and WRMF algorithms are the most appropriate if our visualization technique is capable of capturing the different ones with both the different rating thresholds. In general, all the characteristics of a rating dataset and to what extend they influence recommenders available perform worse with an higher threshold. In the recommendation coverage and accuracy. fact, from the visualizations it is clear that the number of available preferences is much lower with respect to the MovieLens 100K 4.1 Experimental setup dataset, as the scatter plot represented in Figure 4b is sparser than We performed two different experiments with the HetRec LastFM the one available in Figure 1. Because user preferences are more dataset and our evaluation framework RecLab [8].4 limited in number and fragmented, the task of any recommender In the first one, we set the rating threshold equal to 0, while in system is necessarily harder. the second one, we set it equal to 1,000. For the other parameters, Interestingly, the Item KNN, differently from the User KNN, we used the default values of the framework: we selected a random experienced a dramatic drop in all the metrics considered. This result may have been caused by the fact that a very low number of 3 http://datascience.ismb.it/rs-viz/ users is available for each item of the dataset. Also this characteristic 4 http://datascience.ismb.it/reclab/ can be observed from the generated scatter plot by looking at the IntRS Workshop, September 2019, Copenhagen, DK Diego Monti, Giuseppe Rizzo, and Maurizio Morisio Table 1: The numerical results of the experimental comparison using the HetRec LastFM dataset. Rating threshold = 0 Rating threshold = 1,000 Algorithm Coverage Precision Recall NDCG Coverage Precision Recall NDCG Random 0.706679 0.000798 0.000745 0.000858 0.705562 0.000107 0.000622 0.000133 Most Popular 0.001692 0.071170 0.071480 0.079673 0.001684 0.022122 0.090233 0.027437 Item KNN 0.235321 0.129362 0.131967 0.145258 0.107233 0.002878 0.013012 0.002686 User KNN 0.030074 0.157234 0.160353 0.193121 0.049343 0.040672 0.160767 0.055013 BPRMF 0.022979 0.081277 0.082248 0.094737 0.003756 0.021695 0.088211 0.024366 WRMF 0.015558 0.159947 0.162332 0.195107 0.012886 0.039606 0.157484 0.053148 (a) Rating threshold = 0 (b) Rating threshold = 1,000 Figure 4: The 3D scatter plots obtained using the HetRec LastFM dataset with different rating thresholds. lower part of Figure 4b. The white horizontal stripes denote groups As future work, we would like to quantitatively characterize of items that have been rated by only a few very active users. rating datasets according to different dimensions and place them in various categories, for example by analyzing the diversity of user preferences or the tendency to rate popular items only. This 5 CONCLUSION AND FUTURE WORK empirical categorization would enable the users of our tool to better understand the ratings available and to select the most appropriate In this paper, we proposed a method for creating graphical sum- recommendation approach according to such proprieties. maries of any rating dataset for the purpose of enabling researchers Furthermore, we would like to improve RS-viz by developing and practitioners to better interpret the results of an offline evalua- other visualization methods to enable more comprehensive analysis tion campaign. Furthermore, we introduced RS-viz, a Web-based and to evaluate its effectiveness by checking if researchers and tool capable of creating an interactive 3D scatter plot according to practitioners are able to correctly use it to explain the performance the aforementioned approach starting from a user provided CSV of different recommender systems on a particular dataset. dataset or a built-in collection of ratings. Finally, additional studies are needed to better understand how We validated the capabilities of such visualizations to reveal the proposed approach could be extended for also visualizing non- useful information by comparing the graphical representations conventional datasets, for example the ones enhanced with context- of the HetRec LastFM dataset constructed with different rating aware information like spatial and temporal data. thresholds with the numerical outcomes of two offline experiments involving various recommendation techniques. IntRS Workshop, September 2019, Copenhagen, DK Diego Monti, Giuseppe Rizzo, and Maurizio Morisio REFERENCES [6] Stephanie Gil, Jesús Bobadilla, Fernando Ortega, and Bo O. Zhu. 2018. Visu- [1] Bruno Cardoso, Gayane Sedrakyan, Francisco Gutiérrez, Denis Parra, Peter alRS : Java framework for visualization of recommender systems information. Brusilovsky, and Katrien Verbert. 2019. IntersectionExplorer, a multi-perspective Knowledge-Based Systems 155 (2018), 66–70. https://doi.org/10.1016/j.knosys. approach for exploring recommendations. International Journal of Human- 2018.04.028 Computer Studies 121 (2019), 73–92. https://doi.org/10.1016/j.ijhcs.2018.04.008 [7] Johannes Kunkel, Benedikt Loepp, and Jürgen Ziegler. 2017. A 3D Item Space [2] Ludovik Çoba, Panagiotis Symeonidis, and Markus Zanker. 2017. Visual Analysis Visualization for Presenting and Manipulating User Preferences in Collaborative of Recommendation Performance. In Proceedings of the Eleventh ACM Conference Filtering. In Proceedings of the 22nd International Conference on Intelligent User on Recommender Systems (RecSys ’17). ACM, New York, NY, USA, 362–363. https: Interfaces (IUI ’17). ACM, New York, NY, USA, 3–15. https://doi.org/10.1145/ //doi.org/10.1145/3109859.3109982 3025171.3025189 [3] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of [8] Diego Monti, Giuseppe Rizzo, and Maurizio Morisio. 2018. A Distributed and Ac- Recommender Algorithms on Top-n Recommendation Tasks. In Proceedings of countable Approach to Offline Recommender Systems Evaluation. In Proceedings the Fourth ACM Conference on Recommender Systems (RecSys ’10). ACM, New of the Workshop on Offline Evaluation for Recommender Systems at the 12th ACM York, NY, USA, 39–46. https://doi.org/10.1145/1864708.1864721 Conference on Recommender Systems. REVEAL 2018, Vancouver, BC, Canada, [4] Zeno Gantner, Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Article 6, 5 pages. https://arxiv.org/abs/1810.04957 2011. MyMediaLite: A Free Recommender System Library. In Proceedings of the [9] Alan Said and Alejandro Bellogín. 2014. Comparative Recommender System 5th ACM Conference on Recommender Systems (RecSys ’11). ACM, New York, NY, Evaluation: Benchmarking Recommendation Frameworks. In Proceedings of the USA, 305–308. https://doi.org/10.1145/2043932.2043989 8th ACM Conference on Recommender Systems (RecSys ’14). ACM, New York, NY, [5] Mouzhi Ge, Carla Delgado-Battenfeld, and Dietmar Jannach. 2010. Beyond USA, 129–136. https://doi.org/10.1145/2645710.2645746 Accuracy: Evaluating Recommender Systems by Coverage and Serendipity. In [10] Saúl Vargas and Pablo Castells. 2011. Rank and Relevance in Novelty and Diversity Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys ’10). Metrics for Recommender Systems. In Proceedings of the Fifth ACM Conference ACM, New York, NY, USA, 257–260. https://doi.org/10.1145/1864708.1864761 on Recommender Systems (RecSys ’11). ACM, New York, NY, USA, 109–116. https: //doi.org/10.1145/2043932.2043955