Evaluation of Collaborative Filtering Algorithms for
                Recommending Articles on CiteULike
                             Denis Parra                                                       Peter Brusilovsky
                  School of Information Sciences                                         School of Information Sciences
                     University of Pittsburgh                                               University of Pittsburgh
                   135 North Bellefield Avenue                                            135 North Bellefield Avenue
                     Pittsburgh, PA 15260                                                   Pittsburgh, PA 15260
                          dap89@pitt.edu                                                        peterb@ pitt.edu

ABSTRACT                                                                    different recommendation approaches. First of all, user-
Motivated by the potential use of collaborative tagging systems to          contributed content is more diverse in its nature and quality than
develop new recommender systems, we have implemented and                    centrally created and structured content of traditional repositories.
compared three variants of user-based collaborative filtering               Second, traditional 5-10 point ratings are typically not available –
algorithms to provide recommendations of articles on CiteULike.             only the fact that an item was contributed or bookmarked by the
On our first approach, Classic Collaborative filtering (CCF), we            user is present in the system. At the same time, the loss of quality
use Pearson correlation to calculate similarity between users and a         control and fine-grained ratings in collaborative tagging systems
classic adjusted ratings formula to rank the recommendations. Our           is compensated by the presence of tags and (in most systems)
second approach, Neighbor-weighted Collaborative Filtering                  explicit connections between users. It looks evident that
(NwCF), incorporates the amount of raters in the ranking formula            recommendation approaches for collaborative tagging systems
of the recommendations. A modified version of the Okapi BM25                should capitalize on the success of classic recommender system,
IR model over users’ tags is implemented on our third approach to           while trying to harness the new power provided by tags and social
form the user neighborhood. Our results suggest that                        links. However, there is no shared understanding of how these
incorporating the number of raters into the algorithms leads to an          features have to be taken into account to improve the quality of
improvement of precision, and they also support that tags can be            personalization. A few pioneer projects explored different ways to
considered as an alternative to Pearson correlation to calculate the        integrate social links or social tags into collaborative
similarity between users and their neighbors in a collaborative             recommendation [3, 4, 6], and content-based recommendation [5]
tagging system.                                                             approaches. To some extent, the results are encouraging -- both
                                                                            social links and tags do indeed improve the personalization
                                                                            quality. At the same time, the overall recommendation quality is
Categories and Subject Descriptors                                          unusually low – the precision for both content based and
H.3.3 [Information Storage and Retrieval]: Information Search               collaborative “tag-aware” recommendation reported in [4, 6] stays
and Retrieval–information filtering; Information Search and                 in the range of 0.1-0.3. The lack of reliable success calls for
Retrieval–selection process.                                                further research on recommendation in social tagging systems.
                                                                            This paper contributes to this stream of research by exploring two
General Terms                                                               extensions of the traditional collaborative filtering approaches.
Algorithms, Measurement, Performance, Experimentation.                      First, we argue that the diverse user-contributed nature of content
                                                                            in collaborative tagging systems requires more evidence of
                                                                            relevance and quality than in traditional systems where the
Keywords                                                                    content is co-rated by the site developers. In this context,
Collaborative-filtering, recommender systems, tagging.                      recommender algorithms should favor items bookmarked by more
                                                                            users. However, classic algorithms do not take the number of
1. INTRODUCTION                                                             raters into account. Second, we argue that due to the large volume
The new generation of collaborative tagging systems such as                 of items and low overlap between user bookmarks traditional
Delicious or CiteULike presented a new challenge to researchers             approach for neighborhood calculation may be not most efficient.
and practitioners in the area of recommender systems. While both            Two users who are very similar in their interests may still have too
content-based [1] and collaborative filtering recommender                   few common items bookmarked. In this context, tags applied by
systems [2] achieved a remarkable success in traditional                    users can provide a more reliable approach to find similar users
information repositories, social tagging systems may need some              and this to get better recommendation. To assess our hypotheses
                                                                            we developed variants of user-based collaborative filtering, which
Permission to make digital or hard copies of all or part of this work for   take into account the number of users who bookmarked an item
personal or classroom use is granted without fee provided that copies are   and one approach use tags-level similarity instead of traditional
not made or distributed for profit or commercial advantage and that         Pearson correlation to form user neighborhood.
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,    The rest of the paper is addressed as follows. Section 2 describes
requires prior specific permission and/or a fee.                            the characteristics of the data and how it was collected. Section 3
Web 3.0: Merging Semantic Web and Social Web at                             describes the three recommender approaches developed: Classic
HyperText ’09 June 29th, 2009, Torino, Italy.                               Collaborative Filtering (CCF), Neighbor-weighted Collaborative
Filtering (NwCF) and BM25-based similarity (BM25). In Section                                         ∑      i ⊂CRu , n
                                                                                                                          (rui − ru )(rni − rn )
                                                                                                                                                                    (1)
4 we describe the study conducted and present the results. Section       userSim(u , n) =
5 introduces relevant related work, in Section 6 we address the                              ∑  i ⊂CRu , n
                                                                                                             (rui − ru )     2
                                                                                                                                 ∑   i ⊂CRu , n
                                                                                                                                                  (rni − rn )   2

discussion and in Section 7 we summarize conclusions and future
work.                                                                    In the formula, r stands for rating, u denotes the center user and n
                                                                         a neighbor. CR u,n denotes the set of co-rated items between u and
                                                                         n. After performing this calculation, we select the top ten most
2. DATASET                                                               similar users. Next, we rank the articles of these users to
We performed our study based on data that we crawled from
                                                                         recommend to the center user, using the formula of predicted
CiteULike 1. The daily datasets provided by CiteULike lack a lot
                                                                         rating for user u with average adjusts described in [2]
of relevant information necessary to develop our algorithms, as
the title and the authors of each article.                                                      ∑   n ⊂ neighbors ( u )
                                                                                                                            userSim(u , n) ⋅ (rni − rn )
                                                                         pred (u , i ) = ru +                                                                       (2)
We selected a group of users to be our center users, i.e., those
who would receive the recommendations. For each one of these
                                                                                                        ∑      n ⊂ neighbors ( u )
                                                                                                                                   userSim(u , n)
center users, we crawled her posted articles (id, title, authors, post
timestamp, and tags associated), the neighborhood of users who
                                                                         3.2 Neighbor-weighted Collaborative
posted her same articles, and the neighborhood of users who share        Filtering (NwCF)
her same tags. To avoid limiting the neighborhood due to tag             This method is an enhancement of our CCF implementation. The
variations as hyphens, underscores and plurals, we enhanced the          neighborhood of ten users is obtained in exactly the same way,
spreading of tags by adding stemmed tags using Krovetz                   using the Pearson correlation. However, we have incorporated the
algorithm, and modified tags changing hyphens and underscores            number of raters in the calculation of the ranking of the articles.
to eventually be added to the set of tags to be crawled.                 We do it due to a large amount of the articles have been rated by
                                                                         only one or at most two users. In this way, we push up in the
The details of the final dataset are described in Table 1. We chose      recommendation list those articles rated by a larger number of
7 center users and we crawled all their articles and tags. We chose      neighbors. The new predicted rating is given by
100 neighbors for each center user, selecting those neighbors with
more shared tags in amount and frequency. There was an overlap            pred ′(u , i ) = log10 (1 + nbr (i )) ⋅ pred (u , i )                                     (3)
between these neighbors, so we finally crawled 358 users,
including center users and neighbors. For each of these neighbors
we also crawled all their articles and tags. In Table 1, annotations     3.3 BM25-based Similarity (BM25)
correspond to tuples of the style {user, article, tag}                   BM25, also known as Okapi BM25, is a non-binary probabilistic
                Table 1. Description of the dataset                      model used in information retrieval [7]. It calculates the relevance
                                                                         that the documents of one collection have, given a query. As we
                     Item         # of unique instances                  try to take advantage of the set of tags of each user, we made two
                     users                 358                           analogies: comparing the tags of the center user with a query, and
                    articles             186,122                         the set of tags of each neighbor with a document. Based on this
                      tags               51,903                          idea, we performed a similarity calculation based on the BM25
                                                                         model and thus we obtained her neighborhood. Our proposed
                  annotations            902,711                         BM25-based similarity model is taken from the calculation of the
                                                                         Retrieval Status Value of a document (RSV d ) of a collection
3. ALGORITHMS                                                            given a query [7]:
To create user-based recommendations using collaborative
                                                                                                         (k1 + 1)tf td                (k 3 + 1)tf tq (4)
filtering, two processes are necessary. The first one is finding the     RSVd = ∑ IDF ⋅                                             ⋅
neighborhood of the center user, i.e., her most similar users. Once                t∈q      k1 ((1 − b) + b × ( Ld / Lave )) + tf td k 3 + tf tq
the most similar users are identified, the second process is to rank
the articles to be recommended. These articles will be taken from        In our model RSV d represents the similarity score between the
the set of articles which the neighbors have rated as their              center user (the terms of the query q) and one neighbor (the terms
favorites, yet discounting those articles that the center user already   of the document d). This similarity is calculated as a sum over
has posted.                                                              every tag t posted by the center user. The neighbor d is
                                                                         represented as her set of tags with their respective frequencies. L d
We implemented three user-based collaborative filtering                  is the document length, in our case is the sum of the frequencies
approaches: Classic Collaborative Filtering (CCF), Neighbor-             of each tag of the neighbor d. L ave is the average of the L d of
weighted Collaborative Filtering (NwCF) and BM25-based                   every neighbor. The term tf td is the frequency of the tag t into the
similarity (BM25).                                                       set of tags of the neighbor d. tf tq represents the frequency of the
3.1 Classic Collaborative Filtering (CCF)                                tag t into the query, i.e., the set of tags of the center user. Finally,
                                                                         k1, k3 and b are parameters that we have been set in 1.2, 1.2 and
This approach is described in detail in [2]. In the CCF model, the
                                                                         0.8 respectively, values slightly different from those suggested by
similarity between two users is calculated using the Pearson
                                                                         default in [7].
correlation over the ratings of their common items. The formula
for the Pearson correlation, as stated in [2], is:                       After calculating the similarity between the center user and each
                                                                         neighbor, we choose the top N similar neighbors, and then we
                                                                         calculate the ranking of the recommended articles using the
1
    www.citeulike.org                                                    formula (3).
4. THE STUDY                                                                                                                                     articles evaluated as Relevant. Besides, we calculated the average
                                                                                                                                                 Novelty for each user on each method.
4.1 Experiments
To perform our study, we selected seven active CiteULike users                                                                                   Figure 1 (a) shows us smooth results on different subjects and not
which had posted at least 50 articles each. Four of the subjects are                                                                             so different results on the values of nDCG between different
part of the Personalized Adaptive Web Systems (PAWS) lab of                                                                                      algorithms. However, if we compare them further, we can see that
the School of Information Sciences at the University of                                                                                          CCF performed the worst and is not so clear which one,
Pittsburgh. Three additional subjects were selected randomly from                                                                                BM25_10, BM25_20 or NwCF are significantly the best. This
a list of active CiteULike users.                                                                                                                result suggests us that the ranking order of the recommendations,
                                                                                                                                                 in general, is very close to the optimal one, where the most
For each subject we generated 4 sets with 10 ranked articles each                                                                                relevant articles are at the top and the less ones at the bottom. On
one. The first three lists were generated using the methods CCF,                                                                                 the other hand, CCF shows in general a better level of novelty.
NwCF and BM25, considering 10 neighbors for each center user.
The fourth list was generated using BM25, yet considering 20                                                                                     The results on Precision_2 and Precision_2_1 do not let us infer
neighbors. To avoid pitfalls in the evaluation [8], for each subject                                                                             easily some ideas, but we can see some trends. In general, CCF
we combined the 4 sets of recommendations into one set, we                                                                                       has the worst results, suggesting that including the amount of
changed the order of the articles randomly and we ask them to                                                                                    raters in the ranking formula is an important factor to consider in
evaluate each article relevancy (relevant, somewhat relevant, and                                                                                the success of these recommendations. In addition, the dissimilar
not relevant), and novelty (novel, somewhat novel, and not novel)                                                                                results of BM25 using 10 and 20 neighbors, suggests that we
using a 3-point scale. For example, one article can be evaluated as                                                                              should have taken a threshold to select the size of the
relevant but not novel (because it was already known), and                                                                                       neighborhood instead of choosing a fixed number such as 10 or
another article can be judged to be relevant and also novel,                                                                                     20. For example, CiteULike shows a neighborhood for each user,
because the user just discovered and found it to be important to                                                                                 including just those who share at least the median number of
her interests.                                                                                                                                   articles of the center user.

Another aspect considered to control the evaluation was to
provide the URL on CiteULike of each article. We requested each
                                                                                                                                                 5. RELATED WORK
subject to evaluate the articles based on that information or                                                                                    A few pioneer projects explored different ways to integrate social
looking for the abstract on the internet, but don’t going further                                                                                links or social tags. In [3], the authors incorporate social tags and
than the abstract.                                                                                                                               also the concept of web of trust for the issue of quality assessment
                                                                                                                                                 into a collaborative recommendation approach. The study in [4]
                                                                                                                                                 investigates the effect of incorporating tags to different CF
4.2 Results                                                                                                                                      algorithms, testing their algorithms on last.fm, a musical social
For each subject, we calculated normalized Discounted
                                                                                                                                                 tagging system, obtaining promising results. The approach
Cumulative Gain (nDCG) [7], Precision_2 @ 5, Precision_2 @
                                                                                                                                                 presented in [5] compared a pure content-based with a tag-
10, Precision_2_1 @ 5 and Precision_2_1 @ 10 over the different
                                                                                                                                                 enhanced recommender, showing an improvement in predicted
initial four lists of recommendations. In Precision_2_1, we
                                                                                                                                                 accuracy in the context of cultural heritage personalization.
consider relevant those articles evaluated as Relevant and
Somewhat Relevant. In Precision_2, we only consider relevant the
                    1.2                                                                                     2                                                                                        1.2
                                                                                                           1.8
                     1                                                                                                                                                                                1
                                                                                                           1.6
                                                                                                           1.4
                                                                                                                                                                              Precision_2 @ 5


                    0.8                                                                                                                                                                              0.8
                                                                                       Average Novelty


                                                                             CCF                           1.2                                                      CCF                                                                                       CCF
 nDCG


                    0.6                                                      NwCF                           1                                                       NwCF                             0.6                                                      NwCF
                                                                             BM25_10                       0.8                                                      BM25_10                                                                                   BM25_10
                    0.4                                                                                                                                                                              0.4
                                                                             BM25_20                       0.6                                                      BM25_20                                                                                   BM25_20
                                                                                                           0.4
                    0.2                                                                                                                                                                              0.2
                                                                                                           0.2
                     0                                                                                      0                                                                                         0
                          user_1 user_2 user_3 user_4 user_5 user_6 user_7                                       user_1 user_2 user_3 user_4 user_5 user_6 user_7                                          user_1 user_2 user_3 user_4 user_5 user_6 user_7

                                           (a) nDCG                                                                        (b) Average Novelty score                                                                 (c) Precision_2 @ 5

                    1.2                                                                                    1.2                                                                                       1.2

                     1                                                                                      1                                                                                         1
                                                                                                                                                                                Precision_2_1 @ 10
                                                                                       Precision_2_1 @ 5


                                                                                                                                                                                                     0.8
 Precision_2 @ 10


                    0.8                                                                                    0.8
                                                                             CCF                                                                                    CCF                                                                                       CCF
                    0.6                                                      NwCF                          0.6                                                      NwCF                             0.6                                                      NwCF
                                                                             BM25_10                                                                                BM25_10                                                                                   BM25_10
                    0.4                                                                                    0.4                                                                                       0.4
                                                                             BM25_20                                                                                BM25_20                                                                                   BM25_20
                    0.2                                                                                    0.2                                                                                       0.2

                     0                                                                                      0                                                                                         0
                          user_1 user_2 user_3 user_4 user_5 user_6 user_7                                       user_1 user_2 user_3 user_4 user_5 user_6 user_7                                          user_1 user_2 user_3 user_4 user_5 user_6 user_7


                                         (d) Precision_2 @ 10                                                                 (e) Precision_2_1 @ 5                                                                 (f) Precision2_1 @ 10


                           Figure 1: Metrics showing the results of each user on each method of the experiment (a) nDCG, (b) Average Novelty,
                                        (c) Precision_2 @ 5, (d) Precision_2 @ 10, (e) Precision_2_1 @ 5, (f) Precision_2_1 @ 10
The study presented in [6] describes the use of CiteULike for          uncertainty produced by items with too few ratings. Third, a tag-
recommending scientific articles to users. They compared three         based approach to obtain the neighborhood of a user on social
different collaborative filtering algorithms, two item-based and       tagging systems can be a suitable alternative to classical Pearson
one user-based, and they found that the user-based performed the       correlation. Our survey to seven users was a preliminary study and
best. They evaluated their algorithms using accuracy metrics as        on eventual investigations we will consider more subjects to
MAP, MMR and Precision @ 10, with low accuracy levels, in the          support our findings.
range 0.1-0.3.
                                                                       For our future research, we have already discussed two ideas.
In [8] McNee et al. developed three algorithms to recommend            Firstly, we want to incorporate tags on the ranking model. On this
articles to users, and they assessed them with a detailed survey on    study we used tags only to obtain the neighborhood, i.e., to
real users. In some algorithms, the subjects provided strong           perform the user-similarity calculations. We believe extending the
negative results, and the authors describe in their conclusion that    use of tags can improve the results of precision of our BM25
when evaluating a recommender system “the evaluation must be           approach. Secondly, we will cluster the users’ tags. Users can
done with real users, as current accuracy metrics cannot detect        have more than one interest of research, which is easy to observe
these problems”. Based on this study we decided to ask the             while examining their tags. We will implement clustering
subjects to evaluate the novelty in addition to the relevance of the   algorithms to identify the different interests of the users and we
recommended articles. Four of our seven subjects commented at          expect to provide more topic-oriented recommendations.
the end of the survey that they found very interesting articles in
their recommendation list.
                                                                       8. ACKNOWLEDGMENTS
                                                                       This material is based upon work supported by the National
6. DISCUSSION                                                          Science Foundation under Grant No. 0840597.
During the development of our approaches, we were stack for a
while on CCF and NwCF for the low quality of the preliminary
recommendations. We were using the ratings given by the users to       9. REFERENCES
obtain their neighborhood, which are given by 5-star scale and an      [1] Pazzani, M. and Billsus, D. 2007 Content-Based
“I've already read it” description. Since many users post articles         Recommendation Systems. The Adaptive Web. (May 2007),
without taking care of the ratings (by default it is 2 stars), and         325-341.
their evaluation criteria can vary a lot among different users, we     [2] Schafer, J., Frankowski, D., Herlocker, J. and Sen, S. 2007
decided to change the scale for a 3-point one. Afterwards, the             Collaborative Filtering Recommender Systems. The
results showed a significant improvement. We suggest paying                Adaptive Web. (May 2007), 291-324.
attention to the rating scale used in recommender algorithms for       [3] Massa, P. and Avesani, P. 2004 Trust-Aware Collaborative
social bookmarking systems, in order to diminish the impact of             Filtering for Recommender Systems. In Proceedings of OTM
noise and users’ criteria.                                                 Confederated International Conferences, CoopIS, DOA, and
                                                                           ODBASE (Agia Napa, Cyprus, Oct. 25-29, 2004). 492-508.
We consider that the inclusion of the amount of raters in the          [4] Tso-Sutter, K. H., Marinho, L. B., and Schmidt-Thieme, L.
ranking formula is an important contribution. The Figure 1 shows           2008. Tag-aware recommender systems by fusion of
clearly that both nDCG and Precision metrics had better results            collaborative filtering algorithms. In Proceedings of the 2008
for NwCF than for CCF. This result supports our claim that the             ACM Symposium on Applied Computing (Fortaleza, Brazil,
“social knowledge” provided by the amount of raters helps to               March 16 - 20, 2008). SAC '08. ACM, New York, NY,
decrease the uncertainty implicit on items with too few ratings.           1995-1999.
However, this approach should be considered carefully depending        [5] de Gemmis, M., Lops, P., Semeraro, G., and Basile, P. 2008.
on the user information need. CCF shows, in general, the best              Integrating tags in a semantic content-based recommender. In
novelty values among the subjects, but this idea should be tested          Proceedings of the 2008 ACM Conference on Recommender
with more users to be claimed as true.                                     Systems (Lausanne, Switzerland, October 23 - 25, 2008).
Regarding BM25-based similarity, in most cases it performs better          RecSys '08. ACM, New York, NY, 163-170.
than CCF, but with no predictable results between using 10 or 20       [6] Bogers, T. and van den Bosch, A. 2008. Recommending
neighbors, which implies that a threshold based on each user               scientific articles using citeulike. In Proceedings of the 2008
characteristics should result better than a fixed number of                ACM Conference on Recommender Systems (Lausanne,
neighbors.                                                                 Switzerland, October 23 - 25, 2008). RecSys '08. ACM, New
                                                                           York, NY, 287-290.
                                                                       [7] Manning, C., Raghavan, P. and Schutze, H. 2008
7. CONCLUSIONS AND FUTURE WORK                                             Introduction to Information Retrieval. Cambridge University
In this study, we implemented three variations of user-based               Press.
collaborative filtering algorithms on the popular social               [8] McNee, S. M., Kapoor, N., and Konstan, J. A. 2006. Don't
collaborative tagging service for scientific articles, CiteULike. We       look stupid: avoiding pitfalls when recommending research
can summarize the results of our study in three main findings.             papers. In Proceedings of the 20th Anniversary Conference
First, classical rating-based collaborative filtering algorithms           on Computer Supported Cooperative Work (Banff, Alberta,
implemented on social tagging systems must analyze carefully the           Canada, November 04 - 08, 2006). CSCW '06. ACM, New
rating scale to avoid noise on the recommendation lists. Second,           York, NY, 171-180.
incorporating the amount of raters on the ranking formula of
classical recommender algorithms can help to decrease the