How Similar is Rating Similarity to Content Similarity?

                                Osman Başkaya                                                             Tevfik Aytekin
                  Department of Computer Engineering                                           Department of Computer Engineering
                         Bahçeşehir University                                                       Bahçeşehir University
                           İstanbul, Turkey                                                            İstanbul, Turkey
                   osman.baskaya@computer.org                                                  tevfik.aytekin@bahcesehir.edu.tr


ABSTRACT                                                                                  they make in predicting the ratings of users for items. Al-
The success of a recommendation algorithm is typically mea-                               though accuracy of predictions is an important aspect of
sured by its ability to predict rating values of items. Al-                               recommender systems, it is not the only one. Recently, in-
though accuracy in rating value prediction is an important                                creasing the diversity of recommendation lists have gained
property of a recommendation algorithm there are other                                    attention among researchers in the field [8, 2]. To be able
properties of recommendation algorithms which are impor-                                  to recommend a diverse set of items to a user is important
tant for user satisfaction. One such property is the diversity                            with respect to user satisfiability because a recommendation
of recommendations. It has been recognized that being able                                list consisting of one type of item (e.g., movies only from the
to recommend a diverse set of items plays an important role                               same genre) might not be very satisfactory even if the ac-
in user satisfaction. One convenient approach for diversifi-                              curacy of rating prediction is high. But here there is one
cation is to use the rating patterns of items. However, in                                issue. We need to define a metric for measuring the diver-
what sense the resulting lists will be diversified is not clear.                          sity of a recommendation list. Then we can try to optimize
In order to assess this we explore the relationship between                               the recommendation list based on this metric. One possible
rating similarity and content similarity of items. We discuss                             metric for measuring the diversity of a recommendation list
the experimental results and the possible implications of our                             of a particular user is described in [2]. This metric measures
findings.                                                                                 the diversity as the average dissimilarity of all pairs of items
                                                                                          in a user’s recommendation list. Formally, it can be defined
                                                                                          as follows:
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information                                                              1     X X
Search and Retrieval                                                                                  D(R) =                  d(i, j),                  (1)
                                                                                                                N (N − 1) i∈R
                                                                                                                                j∈R,j6=i


General Terms                                                                             where R is the recommendation list of a user and N = |R|.
                                                                                          d(i, j) is the dissimilarity of items i and j which is defined
Experimentation, Measurement                                                              as one minus the similarity of items i and j.
                                                                                             We think that average dissimilarity is a reasonable way to
Keywords                                                                                  measure the diversity of a list of items. However, the impor-
diversity, recommender systems, collaborative filtering                                   tant part is how to define d(i, j), i.e., the dissimilarity of two
                                                                                          items which is unspecified in equation (1). The problem is
                                                                                          not to choose a similarity metric such as Pearson or cosine.
1. INTRODUCTION                                                                           The problem is whether we can use the rating patterns (vec-
   Recommender systems help users to pick items of interest                               tors) of items in order to measure their similarity. And if we
based on explicit or implicit information that users provide                              use these rating patterns, in what respect the recommenda-
to the system. One of the most successful and widely used                                 tion lists will be diversified? For example, if it is a movie
technique in recommender systems is called collaborative fil-                             recommender system, will the recommendation lists contain
tering (CF) [7]. CF algorithms try to predict the ratings of                              more movies from different genres or will the content of the
a user based on the ratings of that user and the ratings of                               movies get diversified?
other users in the system. The performance of collabora-                                     In order to answer these questions we will compare rating
tive filtering algorithms is typically measured by the error                              similarity with two types of content similarities which we
                                                                                          will define below. We hope that the results we discuss will
                                                                                          shed some light on these types of questions and stimulate
Permission to make digital or hard copies of all or part of this work for                 discussion on diversification.
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to           2.    RELATED WORKS
republish, to post on servers or to redistribute to lists, requires prior specific          In hybrid recommendations content information is used in
permission and/or a fee.                                                                  order to increase the accuracy of rating predictions especially
Copyright is held by the author/owner(s). Workshop on Recommendation
Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction with                      for items whose ratings are too sparse. For example [3, 5,
ACM RecSys 2012. September 9, 2012, Dublin, Ireland.                                      6] use content information collected from sources such as


                                                                                     27
Wikipedia and IMDB in order to improve the accuracy of                 two matrices: one is the Movie-User matrix which holds
rating predictions. These works indirectly show that there is          the ratings of users on movies and the other is the Movie-
indeed some positive relationship between rating similarity            TFIDF matrix which holds the tf-idf weights for each docu-
and content similarity. Otherwise, it was not possible to              ment. For evaluation we use the following methodology. For
increase the prediction accuracy using content information.            each movie we find the most similar 100 movies using the
  Another paper which comes close to our concerns is [1]               Movie-User matrix (rating neighborhood) and the most sim-
Here, the authors propose a new algorithm for diversifying             ilar 100 movies using Movie-TFIDF matrix (content neigh-
recommendation lists. Their algorithm uses rating patterns             borhood). We then find the number of common items in
of movies for diversification. They evaluate the results by            these two neighborhoods. It turns out that on average there
looking at how well the recommendation lists are diversified           are 14.74 common movies in the two neighborhoods. If we
with respect to genre and movie series they belong. They               generate the neighborhoods randomly this value turns out
report that the resulting lists’ diversity increase in both re-        to be around 2.80. Randomization tests show that this dif-
spects (genre and series). However, to the best of our knowl-          ference is significant (p < 0.01).
edge there are no direct comparisons between rating and                   We run the same experiment with different neighborhood
content similarity. In this paper we examine directly these            sizes (20 and 50) but the percentages of the number of com-
two types of similarities.                                             mon items in the rating and content neighborhoods turn
                                                                       out to be similar to the percentages we get when we use a
3. ITEM CONTENT GENERATION                                             neighborhood of size 100.
  In our experiments we use Movielens1 (1M) data set. In                  We also test whether there is a relationship between the
order to compare movies’ rating patterns to their contents             number of ratings and the correspondence between rating
we first need to generate movie content information. We use            and content similarity. To see this we find the rating and
two sources of information to this end. One source of content          content neighborhoods of those movies which have similar
information comes from Wikipedia articles corresponding to             number of ratings. To do this we divide the movies into rat-
movies in the Movielens dataset. The other source of con-              ing intervals according to the number of ratings they have:
tent information comes from genre information which are                movies which have ratings between 1-100, between 101-200,
provided in the dataset. Details of content generation are             and so on. If an interval has less than 20 movies, we merge
given below.                                                           it with the previous one in order to increase the significance
                                                                       of the results. Figure 1 shows the average number of com-
3.1 Content Generation from Wikipedia                                  mon items in the rating and content neighborhood sets of
   The Movielens dataset contains 3883 distinct movies and             movies as a function of rating intervals. Interestingly, Fig-
6040 users. Some of these movies are not rated by any user.            ure 1 shows a clear linear correlation, i.e., as the number of
Also some of the movies have no corresponding entries in               ratings increases the number of common items in the con-
Wikipedia. After discarding these movies we are able to                tent and rating neighborhood of movies also increases. One
fetch 3417 (approximately 88% of all movies) movie articles            possible explanation of this positive linear correlation might
from Wikipedia.                                                        be this. Generally, there is a positive relationship between
   In this work we only use the text of each Wikipedia article         the number of ratings and the popularity of a movie. This
(we do not use link structure or category information of               means that popular movies receive ratings from many dif-
articles). The text of a Wikipedia article consists of parts           ferent people with different tastes. Hence the rating pat-
such as “Plot”, “Cast”, and “Release”. We do not include               terns of popular movies reflect a diverse set of characteris-
“References” and “See also” parts of the text since they may           tics. Wikipedia movie articles also have rich contents reflect-
contain information which is unrelated to the content of the           ing different characteristics of movies. This might explain
movies. After extracting the text of each document we apply            why a movie’s rating neighborhood approaches to its content
some basic preprocessing steps such as stemming and stop-              neighborhood as the number of ratings increase.
words removal. We use a vector space model to represent                   In the next set of experiments our aim is to understand
text documents.                                                        the relationship between movie rating patterns and movie
                                                                       genres provided in the Movielens dataset. Genre keywords
3.2 Genre Information                                                  provide limited information compared to Wikipedia articles.
   As a second source of content we use the genre keywords             Because Wikipedia articles contain terms that give informa-
(such as adventure, action, comedy, etc.) provided by the              tion not only about the genre of a movie but also about the
Movielens dataset. Each movie in the dataset is associated             director, starring, musical composition, etc.
with one or more genre keywords. We define the genre sim-                 In order to measure the relationship between movie rating
ilarity between two movies using the Jaccard metric given              patterns and genres we applied a similar methodology. For
below:                                                                 each movie m we find the most similar 100 movies using
                              |Gi ∩ Gj |                               the Movie-User matrix (that is the rating neighborhood)
                    J(i, j) =                          (2)             and find the Jaccard similarity (as defined in equation 2)
                              |Gi ∪ Gj |
                                                                       between movie m and movies in its rating neighborhood.
where Gi and Gj are genre sets of items i and j.                       The average Jaccard similarity value turns out to be 0.43.
                                                                       If we generate the rating neighborhood randomly we find a
4. EXPERIMENTS                                                         Jaccard value around 0.17. Randomization tests show that
  In the first set of experiments we try to understand the             this difference is significant (p < 0.01).
relation between movie rating patterns and content gener-                 We also test whether there is a relationship between the
ated from the corresponding Wikipedia articles. We have                number of ratings and genre similarity. Similar to the ex-
1                                                                      periment we described above we divided the movies into rat-
    http://www.grouplens.org/node/73


                                                                  28
                                                                     5.   CONCLUSION
                                                                        We should note at the outset that the conclusions pre-
                                                                     sented here are not conclusive. Different experiments on
                                                                     different datasets and with different item types need to be
                                                                     done in order to drive more firm conclusions. However, we
                                                                     hope that these experiments and results will stimulate dis-
                                                                     cussion and further research.
                                                                        In this work we examined the relationship between rating
                                                                     similarity and content similarity of movies in the Movielens
                                                                     dataset. We examined two kinds of content: one of them
                                                                     is the tf-idf weights of movie articles in Wikipedia and the
                                                                     other is the genre keywords of movies provided by the Movie-
                                                                     lens dataset.
                                                                        We found that to a certain degree there is a similarity be-
                                                                     tween rating similarity and Wikipedia content similarity and
                                                                     also between rating similarity and genre similarity. However,
                                                                     we leave open to discussion the magnitude of these similari-
Figure 1: Average number of common movies as a                       ties. We also found that as the number of ratings of a movie
function of rating intervals.                                        increases its rating similarity approaches to its Wikipedia
                                                                     content similarity whereas its rating similarity diverges away
                                                                     from its genre similarity.
                                                                        According to these results if diversification is done based
                                                                     on the rating patterns of movies then the recommendation
                                                                     lists will likely be diversified with respect to the content
                                                                     of movies to some extent. So, if no content information is
                                                                     available or it is difficult to get it, it might be useful to use
                                                                     rating patterns to diversify the recommendation lists.
                                                                        To this analysis we plan to add latent characteristics of
                                                                     items generated by matrix factorization methods [4]. We
                                                                     plan to explore the correspondences among similarities de-
                                                                     fined over rating patterns, contents, and latent characteris-
                                                                     tics of items.

                                                                     6.   REFERENCES
                                                                     [1] R. Boim, T. Milo, and S. Novgorodov. Diversification
                                                                         and refinement in collaborative filtering recommender.
                                                                         In CIKM, pages 739–744, 2011.
Figure 2: Average Jaccard index as a function of                     [2] N. Hurley and M. Zhang. Novelty and diversity in
rating intervals.                                                        top-N recommendation - analysis and evaluation. ACM
                                                                         Trans. Internet Techn, 10(4):14, 2011.
                                                                     [3] G. Katz, N. Ofek, B. Shapira, L. Rokach, and G. Shani.
                                                                         Using wikipedia to boost collaborative filtering
ing intervals according to the number of ratings they have.
                                                                         techniques. In RecSys, pages 285–288, 2011.
Then for each movie m in a rating interval we calculate the
Jaccard similarity value between the movie m and its rat-            [4] Y. Koren, R. M. Bell, and C. Volinsky. Matrix
ing neighborhood of 100 movies then calculate the averages               factorization techniques for recommender systems.
per rating interval. Figure 2 shows these average values as              IEEE Computer, 42(8):30–37, 2009.
a function of rating intervals. Here, we again have an in-           [5] A. Loizou and S. Dasmahapatra. Using Wikipedia to
teresting case. There is a negative linear correlation which             alleviate data sparsity issues in Recommender Systems,
means that the more a movie has ratings the more its rating              pages 104–111. IEEE, 2010.
similarity diverges from its genre similarity.                       [6] P. Melville, R. J. Mooney, and R. Nagarajan.
   The reason underlying these results might be this. Movies             Content-boosted collaborative filtering for improved
which have limited number of ratings (unpopular movies)                  recommendations. In AAAI/IAAI, pages 187–192, 2002.
are generally watched by the fans of that genre. For exam-           [7] J. B. Schafer, D. Frankowski, J. L. Herlocker, and
ple, a fan of sci-fi movies may also watch an unpopular sci-             S. Sen. Collaborative filtering recommender systems. In
fi movie. So, unpopular movies generally get ratings from                The Adaptive Web, pages 291–324, 2007.
the same set of users who are fans of that movie’s genre.            [8] M. Zhang and N. Hurley. Avoiding monotony:
And this makes the rating vectors of those movies (same                  improving the diversity of recommendation lists. In
genre movies) similar to each other. On the other hand if                RecSys, pages 123–130, 2008.
a movie is popular than it gets ratings from a diverse set of
users which causes their rating neighborhoods diverge from
its genre.


                                                                29