How Similar is Rating Similarity to Content Similarity? Osman Başkaya Tevfik Aytekin Department of Computer Engineering Department of Computer Engineering Bahçeşehir University Bahçeşehir University İstanbul, Turkey İstanbul, Turkey osman.baskaya@computer.org tevfik.aytekin@bahcesehir.edu.tr ABSTRACT they make in predicting the ratings of users for items. Al- The success of a recommendation algorithm is typically mea- though accuracy of predictions is an important aspect of sured by its ability to predict rating values of items. Al- recommender systems, it is not the only one. Recently, in- though accuracy in rating value prediction is an important creasing the diversity of recommendation lists have gained property of a recommendation algorithm there are other attention among researchers in the field [8, 2]. To be able properties of recommendation algorithms which are impor- to recommend a diverse set of items to a user is important tant for user satisfaction. One such property is the diversity with respect to user satisfiability because a recommendation of recommendations. It has been recognized that being able list consisting of one type of item (e.g., movies only from the to recommend a diverse set of items plays an important role same genre) might not be very satisfactory even if the ac- in user satisfaction. One convenient approach for diversifi- curacy of rating prediction is high. But here there is one cation is to use the rating patterns of items. However, in issue. We need to define a metric for measuring the diver- what sense the resulting lists will be diversified is not clear. sity of a recommendation list. Then we can try to optimize In order to assess this we explore the relationship between the recommendation list based on this metric. One possible rating similarity and content similarity of items. We discuss metric for measuring the diversity of a recommendation list the experimental results and the possible implications of our of a particular user is described in [2]. This metric measures findings. the diversity as the average dissimilarity of all pairs of items in a user’s recommendation list. Formally, it can be defined as follows: Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information 1 X X Search and Retrieval D(R) = d(i, j), (1) N (N − 1) i∈R j∈R,j6=i General Terms where R is the recommendation list of a user and N = |R|. d(i, j) is the dissimilarity of items i and j which is defined Experimentation, Measurement as one minus the similarity of items i and j. We think that average dissimilarity is a reasonable way to Keywords measure the diversity of a list of items. However, the impor- diversity, recommender systems, collaborative filtering tant part is how to define d(i, j), i.e., the dissimilarity of two items which is unspecified in equation (1). The problem is not to choose a similarity metric such as Pearson or cosine. 1. INTRODUCTION The problem is whether we can use the rating patterns (vec- Recommender systems help users to pick items of interest tors) of items in order to measure their similarity. And if we based on explicit or implicit information that users provide use these rating patterns, in what respect the recommenda- to the system. One of the most successful and widely used tion lists will be diversified? For example, if it is a movie technique in recommender systems is called collaborative fil- recommender system, will the recommendation lists contain tering (CF) [7]. CF algorithms try to predict the ratings of more movies from different genres or will the content of the a user based on the ratings of that user and the ratings of movies get diversified? other users in the system. The performance of collabora- In order to answer these questions we will compare rating tive filtering algorithms is typically measured by the error similarity with two types of content similarities which we will define below. We hope that the results we discuss will shed some light on these types of questions and stimulate Permission to make digital or hard copies of all or part of this work for discussion on diversification. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to 2. RELATED WORKS republish, to post on servers or to redistribute to lists, requires prior specific In hybrid recommendations content information is used in permission and/or a fee. order to increase the accuracy of rating predictions especially Copyright is held by the author/owner(s). Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction with for items whose ratings are too sparse. For example [3, 5, ACM RecSys 2012. September 9, 2012, Dublin, Ireland. 6] use content information collected from sources such as 27 Wikipedia and IMDB in order to improve the accuracy of two matrices: one is the Movie-User matrix which holds rating predictions. These works indirectly show that there is the ratings of users on movies and the other is the Movie- indeed some positive relationship between rating similarity TFIDF matrix which holds the tf-idf weights for each docu- and content similarity. Otherwise, it was not possible to ment. For evaluation we use the following methodology. For increase the prediction accuracy using content information. each movie we find the most similar 100 movies using the Another paper which comes close to our concerns is [1] Movie-User matrix (rating neighborhood) and the most sim- Here, the authors propose a new algorithm for diversifying ilar 100 movies using Movie-TFIDF matrix (content neigh- recommendation lists. Their algorithm uses rating patterns borhood). We then find the number of common items in of movies for diversification. They evaluate the results by these two neighborhoods. It turns out that on average there looking at how well the recommendation lists are diversified are 14.74 common movies in the two neighborhoods. If we with respect to genre and movie series they belong. They generate the neighborhoods randomly this value turns out report that the resulting lists’ diversity increase in both re- to be around 2.80. Randomization tests show that this dif- spects (genre and series). However, to the best of our knowl- ference is significant (p < 0.01). edge there are no direct comparisons between rating and We run the same experiment with different neighborhood content similarity. In this paper we examine directly these sizes (20 and 50) but the percentages of the number of com- two types of similarities. mon items in the rating and content neighborhoods turn out to be similar to the percentages we get when we use a 3. ITEM CONTENT GENERATION neighborhood of size 100. In our experiments we use Movielens1 (1M) data set. In We also test whether there is a relationship between the order to compare movies’ rating patterns to their contents number of ratings and the correspondence between rating we first need to generate movie content information. We use and content similarity. To see this we find the rating and two sources of information to this end. One source of content content neighborhoods of those movies which have similar information comes from Wikipedia articles corresponding to number of ratings. To do this we divide the movies into rat- movies in the Movielens dataset. The other source of con- ing intervals according to the number of ratings they have: tent information comes from genre information which are movies which have ratings between 1-100, between 101-200, provided in the dataset. Details of content generation are and so on. If an interval has less than 20 movies, we merge given below. it with the previous one in order to increase the significance of the results. Figure 1 shows the average number of com- 3.1 Content Generation from Wikipedia mon items in the rating and content neighborhood sets of The Movielens dataset contains 3883 distinct movies and movies as a function of rating intervals. Interestingly, Fig- 6040 users. Some of these movies are not rated by any user. ure 1 shows a clear linear correlation, i.e., as the number of Also some of the movies have no corresponding entries in ratings increases the number of common items in the con- Wikipedia. After discarding these movies we are able to tent and rating neighborhood of movies also increases. One fetch 3417 (approximately 88% of all movies) movie articles possible explanation of this positive linear correlation might from Wikipedia. be this. Generally, there is a positive relationship between In this work we only use the text of each Wikipedia article the number of ratings and the popularity of a movie. This (we do not use link structure or category information of means that popular movies receive ratings from many dif- articles). The text of a Wikipedia article consists of parts ferent people with different tastes. Hence the rating pat- such as “Plot”, “Cast”, and “Release”. We do not include terns of popular movies reflect a diverse set of characteris- “References” and “See also” parts of the text since they may tics. Wikipedia movie articles also have rich contents reflect- contain information which is unrelated to the content of the ing different characteristics of movies. This might explain movies. After extracting the text of each document we apply why a movie’s rating neighborhood approaches to its content some basic preprocessing steps such as stemming and stop- neighborhood as the number of ratings increase. words removal. We use a vector space model to represent In the next set of experiments our aim is to understand text documents. the relationship between movie rating patterns and movie genres provided in the Movielens dataset. Genre keywords 3.2 Genre Information provide limited information compared to Wikipedia articles. As a second source of content we use the genre keywords Because Wikipedia articles contain terms that give informa- (such as adventure, action, comedy, etc.) provided by the tion not only about the genre of a movie but also about the Movielens dataset. Each movie in the dataset is associated director, starring, musical composition, etc. with one or more genre keywords. We define the genre sim- In order to measure the relationship between movie rating ilarity between two movies using the Jaccard metric given patterns and genres we applied a similar methodology. For below: each movie m we find the most similar 100 movies using |Gi ∩ Gj | the Movie-User matrix (that is the rating neighborhood) J(i, j) = (2) and find the Jaccard similarity (as defined in equation 2) |Gi ∪ Gj | between movie m and movies in its rating neighborhood. where Gi and Gj are genre sets of items i and j. The average Jaccard similarity value turns out to be 0.43. If we generate the rating neighborhood randomly we find a 4. EXPERIMENTS Jaccard value around 0.17. Randomization tests show that In the first set of experiments we try to understand the this difference is significant (p < 0.01). relation between movie rating patterns and content gener- We also test whether there is a relationship between the ated from the corresponding Wikipedia articles. We have number of ratings and genre similarity. Similar to the ex- 1 periment we described above we divided the movies into rat- http://www.grouplens.org/node/73 28 5. CONCLUSION We should note at the outset that the conclusions pre- sented here are not conclusive. Different experiments on different datasets and with different item types need to be done in order to drive more firm conclusions. However, we hope that these experiments and results will stimulate dis- cussion and further research. In this work we examined the relationship between rating similarity and content similarity of movies in the Movielens dataset. We examined two kinds of content: one of them is the tf-idf weights of movie articles in Wikipedia and the other is the genre keywords of movies provided by the Movie- lens dataset. We found that to a certain degree there is a similarity be- tween rating similarity and Wikipedia content similarity and also between rating similarity and genre similarity. However, we leave open to discussion the magnitude of these similari- Figure 1: Average number of common movies as a ties. We also found that as the number of ratings of a movie function of rating intervals. increases its rating similarity approaches to its Wikipedia content similarity whereas its rating similarity diverges away from its genre similarity. According to these results if diversification is done based on the rating patterns of movies then the recommendation lists will likely be diversified with respect to the content of movies to some extent. So, if no content information is available or it is difficult to get it, it might be useful to use rating patterns to diversify the recommendation lists. To this analysis we plan to add latent characteristics of items generated by matrix factorization methods [4]. We plan to explore the correspondences among similarities de- fined over rating patterns, contents, and latent characteris- tics of items. 6. REFERENCES [1] R. Boim, T. Milo, and S. Novgorodov. Diversification and refinement in collaborative filtering recommender. In CIKM, pages 739–744, 2011. Figure 2: Average Jaccard index as a function of [2] N. Hurley and M. Zhang. Novelty and diversity in rating intervals. top-N recommendation - analysis and evaluation. ACM Trans. Internet Techn, 10(4):14, 2011. [3] G. Katz, N. Ofek, B. Shapira, L. Rokach, and G. Shani. Using wikipedia to boost collaborative filtering ing intervals according to the number of ratings they have. techniques. In RecSys, pages 285–288, 2011. Then for each movie m in a rating interval we calculate the Jaccard similarity value between the movie m and its rat- [4] Y. Koren, R. M. Bell, and C. Volinsky. Matrix ing neighborhood of 100 movies then calculate the averages factorization techniques for recommender systems. per rating interval. Figure 2 shows these average values as IEEE Computer, 42(8):30–37, 2009. a function of rating intervals. Here, we again have an in- [5] A. Loizou and S. Dasmahapatra. Using Wikipedia to teresting case. There is a negative linear correlation which alleviate data sparsity issues in Recommender Systems, means that the more a movie has ratings the more its rating pages 104–111. IEEE, 2010. similarity diverges from its genre similarity. [6] P. Melville, R. J. Mooney, and R. Nagarajan. The reason underlying these results might be this. Movies Content-boosted collaborative filtering for improved which have limited number of ratings (unpopular movies) recommendations. In AAAI/IAAI, pages 187–192, 2002. are generally watched by the fans of that genre. For exam- [7] J. B. Schafer, D. Frankowski, J. L. Herlocker, and ple, a fan of sci-fi movies may also watch an unpopular sci- S. Sen. Collaborative filtering recommender systems. In fi movie. So, unpopular movies generally get ratings from The Adaptive Web, pages 291–324, 2007. the same set of users who are fans of that movie’s genre. [8] M. Zhang and N. Hurley. Avoiding monotony: And this makes the rating vectors of those movies (same improving the diversity of recommendation lists. In genre movies) similar to each other. On the other hand if RecSys, pages 123–130, 2008. a movie is popular than it gets ratings from a diverse set of users which causes their rating neighborhoods diverge from its genre. 29