Evaluation of Collaborative Filtering Algorithms for Recommending Articles on CiteULike Denis Parra Peter Brusilovsky School of Information Sciences School of Information Sciences University of Pittsburgh University of Pittsburgh 135 North Bellefield Avenue 135 North Bellefield Avenue Pittsburgh, PA 15260 Pittsburgh, PA 15260 dap89@pitt.edu peterb@ pitt.edu ABSTRACT different recommendation approaches. First of all, user- Motivated by the potential use of collaborative tagging systems to contributed content is more diverse in its nature and quality than develop new recommender systems, we have implemented and centrally created and structured content of traditional repositories. compared three variants of user-based collaborative filtering Second, traditional 5-10 point ratings are typically not available – algorithms to provide recommendations of articles on CiteULike. only the fact that an item was contributed or bookmarked by the On our first approach, Classic Collaborative filtering (CCF), we user is present in the system. At the same time, the loss of quality use Pearson correlation to calculate similarity between users and a control and fine-grained ratings in collaborative tagging systems classic adjusted ratings formula to rank the recommendations. Our is compensated by the presence of tags and (in most systems) second approach, Neighbor-weighted Collaborative Filtering explicit connections between users. It looks evident that (NwCF), incorporates the amount of raters in the ranking formula recommendation approaches for collaborative tagging systems of the recommendations. A modified version of the Okapi BM25 should capitalize on the success of classic recommender system, IR model over users’ tags is implemented on our third approach to while trying to harness the new power provided by tags and social form the user neighborhood. Our results suggest that links. However, there is no shared understanding of how these incorporating the number of raters into the algorithms leads to an features have to be taken into account to improve the quality of improvement of precision, and they also support that tags can be personalization. A few pioneer projects explored different ways to considered as an alternative to Pearson correlation to calculate the integrate social links or social tags into collaborative similarity between users and their neighbors in a collaborative recommendation [3, 4, 6], and content-based recommendation [5] tagging system. approaches. To some extent, the results are encouraging -- both social links and tags do indeed improve the personalization quality. At the same time, the overall recommendation quality is Categories and Subject Descriptors unusually low – the precision for both content based and H.3.3 [Information Storage and Retrieval]: Information Search collaborative “tag-aware” recommendation reported in [4, 6] stays and Retrieval–information filtering; Information Search and in the range of 0.1-0.3. The lack of reliable success calls for Retrieval–selection process. further research on recommendation in social tagging systems. This paper contributes to this stream of research by exploring two General Terms extensions of the traditional collaborative filtering approaches. Algorithms, Measurement, Performance, Experimentation. First, we argue that the diverse user-contributed nature of content in collaborative tagging systems requires more evidence of relevance and quality than in traditional systems where the Keywords content is co-rated by the site developers. In this context, Collaborative-filtering, recommender systems, tagging. recommender algorithms should favor items bookmarked by more users. However, classic algorithms do not take the number of 1. INTRODUCTION raters into account. Second, we argue that due to the large volume The new generation of collaborative tagging systems such as of items and low overlap between user bookmarks traditional Delicious or CiteULike presented a new challenge to researchers approach for neighborhood calculation may be not most efficient. and practitioners in the area of recommender systems. While both Two users who are very similar in their interests may still have too content-based [1] and collaborative filtering recommender few common items bookmarked. In this context, tags applied by systems [2] achieved a remarkable success in traditional users can provide a more reliable approach to find similar users information repositories, social tagging systems may need some and this to get better recommendation. To assess our hypotheses we developed variants of user-based collaborative filtering, which Permission to make digital or hard copies of all or part of this work for take into account the number of users who bookmarked an item personal or classroom use is granted without fee provided that copies are and one approach use tags-level similarity instead of traditional not made or distributed for profit or commercial advantage and that Pearson correlation to form user neighborhood. copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, The rest of the paper is addressed as follows. Section 2 describes requires prior specific permission and/or a fee. the characteristics of the data and how it was collected. Section 3 Web 3.0: Merging Semantic Web and Social Web at describes the three recommender approaches developed: Classic HyperText ’09 June 29th, 2009, Torino, Italy. Collaborative Filtering (CCF), Neighbor-weighted Collaborative Filtering (NwCF) and BM25-based similarity (BM25). In Section ∑ i ⊂CRu , n (rui − ru )(rni − rn ) (1) 4 we describe the study conducted and present the results. Section userSim(u , n) = 5 introduces relevant related work, in Section 6 we address the ∑ i ⊂CRu , n (rui − ru ) 2 ∑ i ⊂CRu , n (rni − rn ) 2 discussion and in Section 7 we summarize conclusions and future work. In the formula, r stands for rating, u denotes the center user and n a neighbor. CR u,n denotes the set of co-rated items between u and n. After performing this calculation, we select the top ten most 2. DATASET similar users. Next, we rank the articles of these users to We performed our study based on data that we crawled from recommend to the center user, using the formula of predicted CiteULike 1. The daily datasets provided by CiteULike lack a lot rating for user u with average adjusts described in [2] of relevant information necessary to develop our algorithms, as the title and the authors of each article. ∑ n ⊂ neighbors ( u ) userSim(u , n) ⋅ (rni − rn ) pred (u , i ) = ru + (2) We selected a group of users to be our center users, i.e., those who would receive the recommendations. For each one of these ∑ n ⊂ neighbors ( u ) userSim(u , n) center users, we crawled her posted articles (id, title, authors, post timestamp, and tags associated), the neighborhood of users who 3.2 Neighbor-weighted Collaborative posted her same articles, and the neighborhood of users who share Filtering (NwCF) her same tags. To avoid limiting the neighborhood due to tag This method is an enhancement of our CCF implementation. The variations as hyphens, underscores and plurals, we enhanced the neighborhood of ten users is obtained in exactly the same way, spreading of tags by adding stemmed tags using Krovetz using the Pearson correlation. However, we have incorporated the algorithm, and modified tags changing hyphens and underscores number of raters in the calculation of the ranking of the articles. to eventually be added to the set of tags to be crawled. We do it due to a large amount of the articles have been rated by only one or at most two users. In this way, we push up in the The details of the final dataset are described in Table 1. We chose recommendation list those articles rated by a larger number of 7 center users and we crawled all their articles and tags. We chose neighbors. The new predicted rating is given by 100 neighbors for each center user, selecting those neighbors with more shared tags in amount and frequency. There was an overlap pred ′(u , i ) = log10 (1 + nbr (i )) ⋅ pred (u , i ) (3) between these neighbors, so we finally crawled 358 users, including center users and neighbors. For each of these neighbors we also crawled all their articles and tags. In Table 1, annotations 3.3 BM25-based Similarity (BM25) correspond to tuples of the style {user, article, tag} BM25, also known as Okapi BM25, is a non-binary probabilistic Table 1. Description of the dataset model used in information retrieval [7]. It calculates the relevance that the documents of one collection have, given a query. As we Item # of unique instances try to take advantage of the set of tags of each user, we made two users 358 analogies: comparing the tags of the center user with a query, and articles 186,122 the set of tags of each neighbor with a document. Based on this tags 51,903 idea, we performed a similarity calculation based on the BM25 model and thus we obtained her neighborhood. Our proposed annotations 902,711 BM25-based similarity model is taken from the calculation of the Retrieval Status Value of a document (RSV d ) of a collection 3. ALGORITHMS given a query [7]: To create user-based recommendations using collaborative (k1 + 1)tf td (k 3 + 1)tf tq (4) filtering, two processes are necessary. The first one is finding the RSVd = ∑ IDF ⋅ ⋅ neighborhood of the center user, i.e., her most similar users. Once t∈q k1 ((1 − b) + b × ( Ld / Lave )) + tf td k 3 + tf tq the most similar users are identified, the second process is to rank the articles to be recommended. These articles will be taken from In our model RSV d represents the similarity score between the the set of articles which the neighbors have rated as their center user (the terms of the query q) and one neighbor (the terms favorites, yet discounting those articles that the center user already of the document d). This similarity is calculated as a sum over has posted. every tag t posted by the center user. The neighbor d is represented as her set of tags with their respective frequencies. L d We implemented three user-based collaborative filtering is the document length, in our case is the sum of the frequencies approaches: Classic Collaborative Filtering (CCF), Neighbor- of each tag of the neighbor d. L ave is the average of the L d of weighted Collaborative Filtering (NwCF) and BM25-based every neighbor. The term tf td is the frequency of the tag t into the similarity (BM25). set of tags of the neighbor d. tf tq represents the frequency of the 3.1 Classic Collaborative Filtering (CCF) tag t into the query, i.e., the set of tags of the center user. Finally, k1, k3 and b are parameters that we have been set in 1.2, 1.2 and This approach is described in detail in [2]. In the CCF model, the 0.8 respectively, values slightly different from those suggested by similarity between two users is calculated using the Pearson default in [7]. correlation over the ratings of their common items. The formula for the Pearson correlation, as stated in [2], is: After calculating the similarity between the center user and each neighbor, we choose the top N similar neighbors, and then we calculate the ranking of the recommended articles using the 1 www.citeulike.org formula (3). 4. THE STUDY articles evaluated as Relevant. Besides, we calculated the average Novelty for each user on each method. 4.1 Experiments To perform our study, we selected seven active CiteULike users Figure 1 (a) shows us smooth results on different subjects and not which had posted at least 50 articles each. Four of the subjects are so different results on the values of nDCG between different part of the Personalized Adaptive Web Systems (PAWS) lab of algorithms. However, if we compare them further, we can see that the School of Information Sciences at the University of CCF performed the worst and is not so clear which one, Pittsburgh. Three additional subjects were selected randomly from BM25_10, BM25_20 or NwCF are significantly the best. This a list of active CiteULike users. result suggests us that the ranking order of the recommendations, in general, is very close to the optimal one, where the most For each subject we generated 4 sets with 10 ranked articles each relevant articles are at the top and the less ones at the bottom. On one. The first three lists were generated using the methods CCF, the other hand, CCF shows in general a better level of novelty. NwCF and BM25, considering 10 neighbors for each center user. The fourth list was generated using BM25, yet considering 20 The results on Precision_2 and Precision_2_1 do not let us infer neighbors. To avoid pitfalls in the evaluation [8], for each subject easily some ideas, but we can see some trends. In general, CCF we combined the 4 sets of recommendations into one set, we has the worst results, suggesting that including the amount of changed the order of the articles randomly and we ask them to raters in the ranking formula is an important factor to consider in evaluate each article relevancy (relevant, somewhat relevant, and the success of these recommendations. In addition, the dissimilar not relevant), and novelty (novel, somewhat novel, and not novel) results of BM25 using 10 and 20 neighbors, suggests that we using a 3-point scale. For example, one article can be evaluated as should have taken a threshold to select the size of the relevant but not novel (because it was already known), and neighborhood instead of choosing a fixed number such as 10 or another article can be judged to be relevant and also novel, 20. For example, CiteULike shows a neighborhood for each user, because the user just discovered and found it to be important to including just those who share at least the median number of her interests. articles of the center user. Another aspect considered to control the evaluation was to provide the URL on CiteULike of each article. We requested each 5. RELATED WORK subject to evaluate the articles based on that information or A few pioneer projects explored different ways to integrate social looking for the abstract on the internet, but don’t going further links or social tags. In [3], the authors incorporate social tags and than the abstract. also the concept of web of trust for the issue of quality assessment into a collaborative recommendation approach. The study in [4] investigates the effect of incorporating tags to different CF 4.2 Results algorithms, testing their algorithms on last.fm, a musical social For each subject, we calculated normalized Discounted tagging system, obtaining promising results. The approach Cumulative Gain (nDCG) [7], Precision_2 @ 5, Precision_2 @ presented in [5] compared a pure content-based with a tag- 10, Precision_2_1 @ 5 and Precision_2_1 @ 10 over the different enhanced recommender, showing an improvement in predicted initial four lists of recommendations. In Precision_2_1, we accuracy in the context of cultural heritage personalization. consider relevant those articles evaluated as Relevant and Somewhat Relevant. In Precision_2, we only consider relevant the 1.2 2 1.2 1.8 1 1 1.6 1.4 Precision_2 @ 5 0.8 0.8 Average Novelty CCF 1.2 CCF CCF nDCG 0.6 NwCF 1 NwCF 0.6 NwCF BM25_10 0.8 BM25_10 BM25_10 0.4 0.4 BM25_20 0.6 BM25_20 BM25_20 0.4 0.2 0.2 0.2 0 0 0 user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_1 user_2 user_3 user_4 user_5 user_6 user_7 (a) nDCG (b) Average Novelty score (c) Precision_2 @ 5 1.2 1.2 1.2 1 1 1 Precision_2_1 @ 10 Precision_2_1 @ 5 0.8 Precision_2 @ 10 0.8 0.8 CCF CCF CCF 0.6 NwCF 0.6 NwCF 0.6 NwCF BM25_10 BM25_10 BM25_10 0.4 0.4 0.4 BM25_20 BM25_20 BM25_20 0.2 0.2 0.2 0 0 0 user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_1 user_2 user_3 user_4 user_5 user_6 user_7 (d) Precision_2 @ 10 (e) Precision_2_1 @ 5 (f) Precision2_1 @ 10 Figure 1: Metrics showing the results of each user on each method of the experiment (a) nDCG, (b) Average Novelty, (c) Precision_2 @ 5, (d) Precision_2 @ 10, (e) Precision_2_1 @ 5, (f) Precision_2_1 @ 10 The study presented in [6] describes the use of CiteULike for uncertainty produced by items with too few ratings. Third, a tag- recommending scientific articles to users. They compared three based approach to obtain the neighborhood of a user on social different collaborative filtering algorithms, two item-based and tagging systems can be a suitable alternative to classical Pearson one user-based, and they found that the user-based performed the correlation. Our survey to seven users was a preliminary study and best. They evaluated their algorithms using accuracy metrics as on eventual investigations we will consider more subjects to MAP, MMR and Precision @ 10, with low accuracy levels, in the support our findings. range 0.1-0.3. For our future research, we have already discussed two ideas. In [8] McNee et al. developed three algorithms to recommend Firstly, we want to incorporate tags on the ranking model. On this articles to users, and they assessed them with a detailed survey on study we used tags only to obtain the neighborhood, i.e., to real users. In some algorithms, the subjects provided strong perform the user-similarity calculations. We believe extending the negative results, and the authors describe in their conclusion that use of tags can improve the results of precision of our BM25 when evaluating a recommender system “the evaluation must be approach. Secondly, we will cluster the users’ tags. Users can done with real users, as current accuracy metrics cannot detect have more than one interest of research, which is easy to observe these problems”. Based on this study we decided to ask the while examining their tags. We will implement clustering subjects to evaluate the novelty in addition to the relevance of the algorithms to identify the different interests of the users and we recommended articles. Four of our seven subjects commented at expect to provide more topic-oriented recommendations. the end of the survey that they found very interesting articles in their recommendation list. 8. ACKNOWLEDGMENTS This material is based upon work supported by the National 6. DISCUSSION Science Foundation under Grant No. 0840597. During the development of our approaches, we were stack for a while on CCF and NwCF for the low quality of the preliminary recommendations. We were using the ratings given by the users to 9. REFERENCES obtain their neighborhood, which are given by 5-star scale and an [1] Pazzani, M. and Billsus, D. 2007 Content-Based “I've already read it” description. Since many users post articles Recommendation Systems. The Adaptive Web. (May 2007), without taking care of the ratings (by default it is 2 stars), and 325-341. their evaluation criteria can vary a lot among different users, we [2] Schafer, J., Frankowski, D., Herlocker, J. and Sen, S. 2007 decided to change the scale for a 3-point one. Afterwards, the Collaborative Filtering Recommender Systems. The results showed a significant improvement. We suggest paying Adaptive Web. (May 2007), 291-324. attention to the rating scale used in recommender algorithms for [3] Massa, P. and Avesani, P. 2004 Trust-Aware Collaborative social bookmarking systems, in order to diminish the impact of Filtering for Recommender Systems. In Proceedings of OTM noise and users’ criteria. Confederated International Conferences, CoopIS, DOA, and ODBASE (Agia Napa, Cyprus, Oct. 25-29, 2004). 492-508. We consider that the inclusion of the amount of raters in the [4] Tso-Sutter, K. H., Marinho, L. B., and Schmidt-Thieme, L. ranking formula is an important contribution. The Figure 1 shows 2008. Tag-aware recommender systems by fusion of clearly that both nDCG and Precision metrics had better results collaborative filtering algorithms. In Proceedings of the 2008 for NwCF than for CCF. This result supports our claim that the ACM Symposium on Applied Computing (Fortaleza, Brazil, “social knowledge” provided by the amount of raters helps to March 16 - 20, 2008). SAC '08. ACM, New York, NY, decrease the uncertainty implicit on items with too few ratings. 1995-1999. However, this approach should be considered carefully depending [5] de Gemmis, M., Lops, P., Semeraro, G., and Basile, P. 2008. on the user information need. CCF shows, in general, the best Integrating tags in a semantic content-based recommender. In novelty values among the subjects, but this idea should be tested Proceedings of the 2008 ACM Conference on Recommender with more users to be claimed as true. Systems (Lausanne, Switzerland, October 23 - 25, 2008). Regarding BM25-based similarity, in most cases it performs better RecSys '08. ACM, New York, NY, 163-170. than CCF, but with no predictable results between using 10 or 20 [6] Bogers, T. and van den Bosch, A. 2008. Recommending neighbors, which implies that a threshold based on each user scientific articles using citeulike. In Proceedings of the 2008 characteristics should result better than a fixed number of ACM Conference on Recommender Systems (Lausanne, neighbors. Switzerland, October 23 - 25, 2008). RecSys '08. ACM, New York, NY, 287-290. [7] Manning, C., Raghavan, P. and Schutze, H. 2008 7. CONCLUSIONS AND FUTURE WORK Introduction to Information Retrieval. Cambridge University In this study, we implemented three variations of user-based Press. collaborative filtering algorithms on the popular social [8] McNee, S. M., Kapoor, N., and Konstan, J. A. 2006. Don't collaborative tagging service for scientific articles, CiteULike. We look stupid: avoiding pitfalls when recommending research can summarize the results of our study in three main findings. papers. In Proceedings of the 20th Anniversary Conference First, classical rating-based collaborative filtering algorithms on Computer Supported Cooperative Work (Banff, Alberta, implemented on social tagging systems must analyze carefully the Canada, November 04 - 08, 2006). CSCW '06. ACM, New rating scale to avoid noise on the recommendation lists. Second, York, NY, 171-180. incorporating the amount of raters on the ranking formula of classical recommender algorithms can help to decrease the