Collaborative and Content-based Filtering for Item
           Recommendation on Social Bookmarking Websites

                                   Toine Bogers                                                   Antal van den Bosch
               ILK / Tilburg centre for Creative Computing                            ILK / Tilburg centre for Creative Computing
                             Tilburg University                                                     Tilburg University
                         P.O. Box 90153, 5000 LE                                                P.O. Box 90153, 5000 LE
                         Tilburg, The Netherlands                                               Tilburg, The Netherlands
                             A.M.Bogers@uvt.nl                                                  Antal.vdnBosch@uvt.nl

ABSTRACT                                                                             both the user and the items: past user preferences, purchase his-
Social bookmarking websites allow users to store, organize, and                      tory, demographic information, item popularity, the metadata char-
search bookmarks of web pages. Users of these services can an-                       acteristics of the products, etc. Social bookmarking websites, with
notate their bookmarks by using informal tags and other metadata,                    their emphasis on open collaborative information access, offer an
such as titles, descriptions, etc. In this paper, we focus on the task               ideal scenario for the application of recommender systems technol-
of item recommendation for social bookmarking websites, i.e. pre-                    ogy. They allow users to manage their favorite bookmarks online
dicting which unseen bookmarks a user might like based on his or                     through a web interface and, in many cases, allow their users to
her profile. We examine how we can incorporate the tags and other                    tag the content they added to the system with keywords. The ag-
metadata into a nearest-neighbor collaborative filtering (CF) algo-                  gregation of these tags, the folksonomy, is an extra annotation layer
rithm, by replacing the traditional usage-based similarity metrics                   connecting users and items. The underlying application then makes
by tag overlap, and by fusing tag-based similarity with usage-based                  all information sharable among users. The success of such websites
similarity. In addition, we perform experiments with content-based                   partly depends on how the connections between users, items, and
filtering by using the metadata content to recommend interesting                     tags are exploited.
items. We generate recommendations directly based on Kullback-                          Social bookmarking websites also offer possibilities for attach-
Leibler divergence of the metadata language models, and we ex-                       ing item-specific metadata to each item, such as the item’s title or
plore the use of this metadata in calculating user and item simi-                    summary. This additional metadata could also be used to support
larities. We perform our experiments on three data sets from two                     the recommendation process. Using item metadata to boost perfor-
different domains: Delicious, CiteULike and BibSonomy.                               mance is not new in recommender systems research, but typically
                                                                                     such content-related information is attached to the items, and is
                                                                                     therefore the same across all users [23]. On the other hand, the tags
Categories and Subject Descriptors                                                   assigned to items are specific to the user who added them. We know
H.3 [Information Storage and Retrieval]: H.3.1 Content Anal-                         of no approaches to item recommendation for social bookmarking
ysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4                     that investigate the use of such metadata.
Systems and Software                                                                    In this paper we focus on the question how we can make use
                                                                                     the information represented by the folksonomy and the item meta-
General Terms                                                                        data to boost the performance of traditional collaborative filtering
                                                                                     algorithms. We make the following contributions. First, we exam-
Algorithms, Measurement, Performance, Experimentation
                                                                                     ine different ways of extending the standard nearest-neighbor algo-
                                                                                     rithm with information about tagging behavior. Second, we explore
Keywords                                                                             how best to use the metadata for recommendation by proposing
Recommender systems, social bookmarking, folksonomies, collab-                       four different algorithms. Finally, we perform our experiments on
orative filtering, content-based filtering                                           publicly available data sets with standardized evaluation metrics to
                                                                                     promote verifiability. The remainder of this paper is structured as
1.     INTRODUCTION                                                                  follows. We start by introducing our data sets and experimental
                                                                                     setup in the next section. In Section 3 we describe and compare
   Recommender systems belong to a class of personalized infor-
                                                                                     different CF algorithms that operate on the folksonomy to gener-
mation filtering technologies that aim to identify which items in
                                                                                     ate recommendations. Section 4 we describe how we exploit the
a catalog might be of interest to a particular user. Recommenda-
                                                                                     metadata in our data sets to generate item recommendations. We
tions can be made using a variety of information sources related to
                                                                                     describe the related work in Section 5 and conclude in Section 6.

Permission to make digital or hard copies of all or part of this work for            2.    METHODOLOGY
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to      2.1    Data Sets
republish, to post on servers or to redistribute to lists, requires prior specific      We base our experiments on four data sets that were collected
permission and/or a fee.                                                             from three different social bookmarking websites with different char-
ACM RecSys ’09 Workshop on Recommender Systems and the Social Web
October 25, 2009, New York, NY, USA.                                                 acteristics: CiteULike, BibSonomy, and Delicious. Two data sets
Copyright 2009 by the author(s).                                                     correspond to the domain of Web page bookmarks (Delicious and
BibSonomy) and the other two cover the domain of scientific arti-        2.2     Experimental Setup & Evaluation
cles (Delicious and BibSonomy).                                             In order to evaluate and compare different recommender algo-
   CiteULike is a social bookmarking service that allows its users to    rithms, we need a proper framework for experimentation and evalu-
add their academic reference library to an online profile1 . Articles    ation. Recommender systems evaluation—and the differences with
can be stored with their metadata, abstracts, and links to the papers    IR evaluation—has been addressed by, among others, [11]. We
at the publishers’ sites. Users can also add personal comments and       evaluate the “Find Good Items” task, also known as Top-N rec-
tags. CiteULike offers daily dumps of their core database. We used       ommendation, where users are provided with a ranked list of rec-
the dump of November 2, 2007 as the basis for our experiments.           ommended items based on their personal profile. We divide each
A dump contains all information on which articles were posted by         data set into a training and test set by randomly selecting 10% of
whom, with which tags, and at what point in time. It does not, how-      the users to be in our test set. Final performance is evaluated on
ever, contain any other item metadata, so we crawled this ourselves      this 10% by withholding 10 items from each of these so-called ac-
from the CiteULike website using the article IDs. Articles are an-       tive users, and using the remaining profile items together with the
notated using the standard BibTeX-like fields, such as title, author     training set to generate the recommendations for those 10%. If the
names, page numbers, publisher information, etc.                         withheld items are predicted at the top of the ranked result list, then
   BibSonomy is a social bookmarking service for sharing Web             the algorithm is considered to do perform well. To prevent over-
page bookmarks and reference lists of scientific articles2 . Items       estimation when optimizing algorithm parameters, we use 10-fold
are stored and represented by their BibTeX metadata representa-          cross-validation. We subdivide our training set into 10 folds and
tions. These can include abstracts and links to the papers at the pub-   use these for 10-fold cross-validation of our parameter optimiza-
lishers’ websites. Users are able to tag their bookmarked content        tion. For each fold, 10 items are withheld from the test fold users
and use these tags to browse and discover related references [13].       to be retrieved by the recommendation algorithm. The final values
BibSonomy’s creators organized the 2008 ECML/PKDD Discov-                for our evaluation metric on the withheld items are then macro-
ery Challenge which focused on social bookmarking, and released          averaged over the 10 folds.
the BibSonomy data set to the public in May 2008 as part of this            In our evaluation, we adopt an IR perspective by treating each of
challenge3 . The organizers made a snapshot of the BibSonomy             the users as a separate query or topic. The 10 withheld items for
system consisting of all resources posted to BibSonomy between           each user constitute the items for which we have relevance judg-
its inception in 2006 and March 31, 2008. It includes the same           ments. Herlocker et al. [11] assess the usefulness of different met-
type of article metadata as we collected for CiteULike. The dis-         rics for different types of recommendation tasks. For the Top-N
tinction between bookmarks and BibTeX records is also made in            recommendation task, they find that metrics that take into account
this snapshot. We therefore split this data dump into a data set con-    the ranking of the items are most appropriate. We therefore evalu-
taining only web bookmarks (Bibsonomy Bookmarks), and a data             ate our algorithms using Mean Average Precision (MAP), which is
set containing only scientific articles (Bibsonomy Articles).            defined as the average of the Average Precision values calculated
   Delicious is a social bookmarking service for storing, sharing,       over each relevant retrieved item. For determining significance of
and discovering web bookmarks. It allows its users to manage             differences between runs, we use a two-tailed paired Student’s t-
and tag URLs of web pages4 . Unlike CiteULike and BibSonomy,             test. We report on significant differences against the best baseline
Delicious does not offer data dumps of their databases, so we gath-      runs using M (and O ) for α = .05 and N (and H ) for α = .01.
ered our data set by crawling a subset of the Delicious website. Be-
cause of our focus on the task of item recommendation for users,
our aim was to collect a balanced, unbiased set of user profiles,        3.    FOLKSONOMIC RECOMMENDATION
i.e. the complete set of bookmarks a user had posted to Delicious.
                                                                            We start by establishing some notation and definitions of the task
From an earlier breadth-first crawl of Delicious we obtained a list
                                                                         at hand, based in part on notation by [9]. In the social bookmarking
of 300,000 users. We randomly selected around 18,000 of these
                                                                         setting, users post items to their personal profiles and can choose
users to match the size of our CiteULike data set, and crawled their
                                                                         to label them with one or more tags. We define a folksonomy to be
entire profiles.
                                                                         the tripartite graph that emerges from this collaborative annotation
                                                                         of items. The resulting ternary relations that make up the tripar-
2.1.1     Data set filtering                                             tite graph can be represented as a 3D matrix of users, items, and
   It is common practice in recommender system evaluation to se-         tags. Figure 1 illustrates this view. We refer to the 3D matrix as
lect realistic subsets of the data sets used to ensure that reliable     D(uk , il , tm ). Here, each element d(k, l, m) of this matrix indicates
recommendations can be generated. This also allows for a fair            if user uk (with k = {1, . . . , K}) tagged item il (with l = {1, . . . , L})
comparisons of different recommendation algorithms [11]. This            with tag tm (with m = {1, . . . , M}), where a value of 1 indicates the
is typically done by filtering out users or items whose profile size     ternary relation is present in the folksonomy.
or popularity falls below a certain threshold. We follow this proce-        In conventional recommender systems, the user-item matrix con-
dure in our preparation of the data sets as well. We only retain the     tains ratings information. These ratings can be explicit, when they
users who have added 20 items or more to their personal profile. In      are entered directly by the user, or implicit, when they are inferred
addition, we filter out all items that occur only once, as well as all   from user behavior. In our case we have implicit, unary ratings
untagged posts. We were able to identify and filter out most of the      where all items that were added by a user receive a rating of 1.
spam content in the CiteULike and BibSonomy data sets. We refer          We extract this ratings matrix R(uk , il ) for all user-item pairs di-
the reader to [5] for more details about this process. Table 1 lists     rectly from the tripartite graph. We denote its individual elements
the statistics of our four data sets after filtering.                    by xk,l = {1, ∅}. Each user is represented in this matrix as its user
                                                                         profile row vector uk , which lists the items that user added to his or
1
  http://www.citeulike.org/                                              her profile. Items are represented by the column vectors of R which
2                                                                        represent the item profile vectors il that contain all users that have
  http://www.bibsonomy.org/
3                                                                        added that item. As shown in Figure 1, we can also extract a user-
  http://www.kde.cs.uni-kassel.de/ws/rsdc08/
4
  http://www.delicious.com/                                              item matrix from D by aggregating over the tag dimension. We then
                                           Table 1: Statistics of the filtered versions of our four data sets.
                                                                    bookmarks                       articles
                                                             Delicious BibSonomy CiteULike BibSonomy
                                    # users                      1,243             192          1,322             167
                                    # items                    152,698          11,165        38,419          12,982
                                    # tags                      42,820          13,233        28,312           5,165
                                    # posts                    238,070          29,096        84,637          29,720
                                    user-item sparsity         99.8746         98.6427       99.8334         98.6291
                                    avg # items per user         191.5           151.5           64.0          178.0
                                    avg # users per item            1.6             2.6           2.2             2.3
                                    avg # tags per user          192.1           203.3           57.3            79.2
                                    avg # users per tag             5.6             2.9           2.7             2.6
                                    avg # tags per item             4.8             8.4           5.3             3.1
                                    avg # items per tag            17.0             7.1           7.3             7.7


                                                                items
                                                                                    3.1     Baseline Recommendation Algorithms
                                                                                       A common and well-understood source of information for rec-
                                                                                    ommendation is usage patterns of adding and rating items. The
                                              users             D                   class of algorithms that exploit such patterns for recommendation
                                                                                    purposes are called Collaborative Filtering algorithms (CF). In this
                                                                             tags   paper we focus on using and extending the k-Nearest Neighbor (k-
                                                                                    NN) algorithm. We pick the k-NN algorithm because it is a well
                                                            Σ                       understood algorithm that can intuitively be extended to include
                                                                                    other information in addition to transaction patterns [7, 10]. There
            i1   items   iL                            i1   items       iL          are two flavors of the k-NN algorithm for CF: user-based filtering
       u1                                         u1                                and item-based filtering. In user-based filtering, we locate the users
                                                                                    most similar to the active users, and look among their items for new
                                  binarize                                          recommendations. In item-based filtering, we locate the most simi-
    users         R                           users             UI
                                                                                    lar items for each of the active user’s items and aggregate these into
       uK                                         uK                                a list of predicted items.

                                                                                    User-based Filtering.
Figure 1: Representing the folksonomy graph as a 3D matrix.                             In the first step of user-based filtering we calculate the similari-
The ratings matrix R is derived from the tripartite graph it-                       ties between pairs of users to identify the most similar users for an
self, and directly represents what items were added by which                        active user. Many different similarity metrics have been proposed
users. Aggregation over the tag dimension of D gives us matrix                      and evaluated over time, such as Pearson’s correlation coefficient
UI, containing the tag counts for each user-item pair. We can                       and cosine similarity [6]. We use the cosine similarity in our ex-
obtain R by binarizing the values in UI..                                           periments as it has often been used successfully on data sets with
                                                                                    implicit ratings [6, 19]. We calculate the cosine similarity between
                                                                                    the active user uk and another user ua on the user profile vectors uk
obtain the K × L user-item matrix UI(uk , il ) = m=1            D(uk , il , tm ),   and ua as simcosine (uk , ua ) = ||uukk||·u||uaa || .
                                                          PM
specifying how many tags each user assigned to each item. Be-                           The next step in user-based filtering is to determine the top N
cause we filtered our data sets to include only tagged content, our                 similar users (or items) for user uk . We denote this set as the
ratings matrix R is the same as a binary version of UI. Similar to                  Set of Similar Users SSU(uk ), which are the top N users of the
the way we defined UI we can also aggregate the content of D over                   set of all users ua , ranked by their cosine similarity. For each
                                                                                    user ua , we only consider those items that ua added to his pro-
                    PLdimensions. We define the K × M user-tag ma-
the user and the item
trix UT(uk , tm ) = l=1  D(uk , il , tm ), specifying how often each user           file (xa,l , ∅). Using this set of nearest neighbors we generate
used certain tag to annotate his or her items. Individual elements                  the
                                                                                    P final prediction scores b                                          xk,l =
                                                                                                                         xk,l for each unseen item il as b
of UT are denoted    by yk,m . We define the L × M item-tag matrix                     ua ∈SSU(uk ) simcosine (uk , ua ). Here, the predicted score of an item
IT(il , tm ) = k=1 D(uk , il , tm ), indicating how many users assigned
              PK
                                                                                    il is the sum of the similarity values (between 0 and 1) of all N
a certain tag to an item. Individual elements of IT are denoted by                  nearest neighbors that actually added item il (i.e. xa,l , ∅).
zl,m . We also define binary versions of UT and IT as UTbinary and                      A recurring observation from the literature about CF algorithms
ITbinary . The row vectors of the UT and IT matrices represent the                  is that universally liked items are not as useful for capturing the
user tag profiles dk and item tag profiles fl respectively. They list               similarity between users as less common items, see e.g. [6]. We
what tags have been assigned by a user to his items, or to an item                  therefore perform two runs: the ‘vanilla’ base run described above
by its users. Formally, the goal of each of the recommendation al-                  (u-bin-sim) and a run where the values in the user vectors are weighted
gorithms discussed in this paper is to produce a ranking of all items               by the inverse user frequencies of the items (u-bin-idf-sim).
l that are not yet in the profile of the active user uk (i.e., xk,l = ∅).
To this end, we predict a score b     xk,l for each item that represents the        Item-based Filtering.
likelihood of that item being relevant for the active user. The final                 The item-based k-NN algorithm operates analogously to the user-
recommendations for a user are generated by ranking all items il by                 based filtering algorithm [19]. Instead of comparing users directly,
their predicted score bxk,l .
Table 2: Results of the folksonomic recommendation runs. Reported are the MAP scores as well as the optimal number of neighbors
N. Best-performing runs for each data group of approaches are printed in bold. Best-performing tag overlap runs for both user-
based and item-based are printed in bold. The percentage difference between the best baseline CF runs and the best tag overlap runs
are indicated after each type.

                                                                                 bookmarks                                articles
                        Runs                                           BibSonomy         Delicious            BibSonomy          CiteULike
                                                                       MAP         N    MAP       N            MAP        N      MAP        N
                        Best baseline UB CF run                       0.0277H      13 0.0046H 15             0.0865H      4    0.0757H      15
                                                                      (u-bin-idf-sim)   (u-bin-sim)            (u-bin-sim)     (u-bin-idf-sim)
                        ut-jaccard-sim                                0.0070H      8   0.0015H 11            0.0459H      6    0.0449H       5
                        ut-dice-sim                                   0.0069O      6   0.0007H     6         0.0333O      4    0.0439H       2
                        ut-bin-sim                                    0.0102H      5   0.0017H 11            0.0332  O
                                                                                                                          4    0.0452  H
                                                                                                                                             3
                                                                              O               O                      H                 H
                        ut-tf-sim                                     0.0069       2   0.0015     25         0.0368       4    0.0428        8
                        ut-tfidf-sim                                  0.0018O      6   0.0013O 17            0.0169H      2    0.0400H       2
                        % Change over best UB CF run                      -63.2%          -63.0%                 -46.9%            -40.7%
                        Best baseline IB CF run                       0.0244H      34 0.0027H 25             0.0737H 49 0.0887H             30
                                                                      (i-bin-idf-sim)   (i-bin-sim)          (i-bin-idf-sim) (i-bin-idf-sim)
                        it-jaccard-sim                                0.0370H      3   0.0083M 21            0.0909H      6    0.0810H      14
                        it-dice-sim                                   0.0317H      2   0.0089M 25            0.0963H      8    0.0814H       8
                        it-bin-sim                                    0.0334H      2   0.0101M 23            0.0868H      5    0.0779H      10
                        it-tf-sim                                     0.0324H      4   0.0100M 11            0.0823  H
                                                                                                                          4    0.0607  H
                                                                                                                                            17
                        it-tfidf-sim                                  0.0287H      8   0.0058H     7         0.1100H      7    0.0789H      21
                        % Change over best IB CF run                      +51.6%         +274.1%                 +49.3%             -8.2%
                        % Change over best CF run                         +33.6%         +119.6%                 +27.2%             -8.2%


we try to identify the best recommendations for each of the items                         rics, which means we calculate them on the binary vectors from the
in an active user’s profile. In other words, for item-based filter-                       UTbinary matrix. The Jaccard Overlap sim Jaccard (dk , da ) between two
ing we calculate the similarities between the test items of the ac-                       users dk and da is defined as |d k ∩da |
                                                                                                                        |dk ∪da |
                                                                                                                                   . Dice’s coefficient simDice (dk , da )
tive user uk and the other items that user has not yet added (so                                              k ∩da |
                                                                                          is defined as 2|d           . We refer to the user-based runs with the Jac-
xk,l = ∅). Similarity between two items il and ib is calculated on the                                    |dk |+|da |
                                                                                          card overlap and Dice’s coefficient as ut-jaccard-sim and ut-dice-
item profile vectors il and ib as simcosine (il , ib ) = ||ili||l ·i||ibb || . Next, we   sim respectively. The cosine similarity is calculated in three dif-
identify the top N most similar items for each of the active user’s                       ferent ways. First, we calculate it on the regular tag count vectors
items il separately. We define this neighborhood as the Set of Sim-                       dk and da from UT as ut-tf-sim, and on the binary vectors from the
ilar Items SSI(il ), where we select the top N of all items not al-                       UTbinary matrix as ut-bin-sim. In addition, we also experiment with
ready in the active user’s profile, ranked on their cosine similarity                     idf-weighting of the tags in the user tag count vectors as we did be-
simcosine (il , ib ) to item il . Using this set of nearest neighbors we                  fore. We refer to this run as ut-tfidf-sim. The item-based versions
generate
       P the final prediction score b         xk,l for each unseen item il as             of these similarity metrics are calculated on the IT and ITbinary ma-
xk,l = ib ∈SSI(il ) simcosine (il , ib ). Here, the predicted score is the sum
b                                                                                         trices. We refer to these five item-based runs as it-jaccard-sim,
of the similarity values (between 0 and 1) of all the most similar                        it-dice-sim, it-bin-sim, it-tf-sim, and it-tfidf-sim.
items that were added by user uk (i.e. xk,b , ∅). Analogous to
user-based filtering, we can also suppress the influence of the most
‘popular’ users, i.e. users that have added a disproportionately large                    3.3     Results & Discussion
number of items to their profile, such as bots or spam users. We re-                         Table 2 compares the results of our baseline CF runs that em-
fer to the item-based filtering runs weighted with the inverse item                       ploy usage-based similarities to the runs that use overlap in tag-
frequency as i-bin-idf-sim and to the unweighted runs as i-bin-sim.                       ging behavior as a source of user and item similarity. We see that
                                                                                          the user-based filtering baseline outperforms item-based filtering
                                                                                          on three of four data sets; only on CiteULike does item-based filter-
3.2      Tag Overlap Similarity                                                           ing work better, where this difference is also statistically significant
   The folksonomies present in our four data sets each constitute an                      (p < 0.05). The other differences between user-based and item-
extra layer of connections between user and items. We exploit this                        based filtering are not significant. There appears to be no clear or
extra layer for determining another type of similarity between users                      statistically significant advantage to applying idf-weighting to the
or items. For instance, users that assign many of the same tags can                       profile vectors. An explanation for the advantage of user-based fil-
be seen as similar, and items that are often assigned the same tags                       tering is that, according to Table 1, the average number of items per
can also be seen as similar.                                                              user is much higher than the average number of users per item. Cal-
   We restrict ourselves to comparing three common similarity met-                        culating a meaningful overlap between user profile vectors could
rics: Jaccard overlap, Dice’s coefficient, and the cosine similarity.                     therefore be more robust than between item profile vectors.
We use these similarities as the basis for item recommendation.                              As for the results with tag overlap, we observe that item simi-
The only difference between this approach and the standard CF al-                         larities based on tag overlap work well for item-based filtering, as
gorithm is in the first step, where the similarities are calculated.                      three of our four data sets show considerable improvements over
For user-based filtering, we calculate tag overlap on the UT ma-                          the best CF baseline runs. Performance increases range from 27%
trix or on the binarized version UTbinary , depending on the metric.                      on Bibsonomy Articles to almost 120% on Delicious, but these
Both the Jaccard overlap and Dice’s coefficient are set-based met-                        are only statistically significant on the Delicious data set. We see
the opposite trend for user-based filtering, where tag overlap re-        sic and extrinsic fields, resulting in a total of 34 runs per algorithm.
sults in significantly worse scores for almost all variants on all data   We did not test the extrinsic fields separately. Due to space restric-
sets, with performance decreases ranging from 40% to 63%. This            tions we only report the results of the best runs for each algorithm.
means that using tag overlap in item-based filtering makes item-
based filtering outperform user-based filtering on all four data sets.    4.1       Content-based Filtering
We believe that it is the reduction in sparsity from using tag overlap        The first approach we propose is content-based filtering where
that causes this difference in performance. On average, the number        the focus is on properly representing the content in our social book-
of tags assigned to an item is 2.5 times higher than the number of        marking data sets. Based on these representations our aim is to
users who have added the item. This means that, on average, item          construct an interest profile of an active user, and then use this
profile vectors from the IT matrix are less sparse than item pro-         profile to rank-order the unseen items by similarity to the profile,
file vectors from the UI matrix, making the possibility of overlap        thereby approximating possible interest in those items. Figure 2
between vectors more likely. Using more values in the similarity          illustrates two different algorithms we propose for content-based
calculation leads to a better estimate of the real similarity between     filtering: profile-centric matching and post-centric matching.
two items.                                                                    The difference between our two content-based filtering algorithms
   For user-based filtering this difference is not as well-pronounced:    is the level of aggregation. In our profile-centric matching ap-
in some data sets users have more items than tags on average, and         proach, we collate all of a user’s assigned metadata into a single
more tags than items in other data sets. This explains why we do          user profile. The intuition here is that by aggregating all of the
not see the same performance increase for the user-based filtering        metadata assigned by a user we can completely capture his or her
runs based on tag overlap. The results of the different tag overlap       interests. Similarly, we construct item profiles that collate all of the
metrics tend to be close together and differences between them are        metadata assigned to those items by all users in the training set. We
not statistically significant. Even though the best-performing simi-      then match the active user profiles against the item profiles on sim-
larity metrics are dependent on the data set, we do see that the met-     ilarity to produce a ranking of all items, as illustrated in the top half
rics operating on the binary vectors from the UTbinary and ITbinary       of Figure 2. After removing the items already in the active user’s
matrices are consistently among the top performers.                       profile, we are left with the final rank-ordered list of recommenda-
   In general, it appears that bookmark recommendation is more            tions.
difficult than article recommendation. We believe this is due to a            In contrast, post-centric matching operates on the level of indi-
difference in topic specificity. The Delicious and Bibsonomy Book-        vidual posts. We match each of an active user’s posts separately
marks data sets cover bookmarks of web pages, which encompass             against all the other posts of unseen items in the training set, as
many more topics than scientific articles do. Users of Delicious          illustrated in the bottom half of Figure 2. This leads to a list of
and Bibsonomy Bookmarks can be expected to have more differ-              matching posts in order of similarity for each of the active user’s
ent topics in their profile, making it more difficult to recommend        posts. Since retrieval scores are not directly comparable between
new, interesting bookmarks based on their profiles. We see evi-           runs, we normalize the original similarity scores simorg into [0,
dence for this explanation in the average number of unique tags per       1] using the maximum and minimum similarity scores simmax and
                                                                                                                 simorg −simmin
user: 203.3 and 192.1 for Bibsonomy Bookmarks and Delicious               simmin according to simnorm = simmax          −simmin
                                                                                                                                . We then calculate a
respectively, which is markedly higher than the 79.2 and 57.3 for         rank-corrected   sum  of   similarity scores    for each item il according to
                                                                          score(i) = log(rank(i
                                                                                       P simnorm (il )
Bibsonomy Articles and CiteULike.                                                                       . The final list of recommendations ranks
                                                                                                 l ))+1
                                                                          every unseen item il by their rank-corrected score score(il ).
4.    RECOMMENDATION USING METADATA
   In addition to the folksonomic structure of the underlying net-              1                                   (a) proﬁle‐centric matching
                                                                            A
work, social bookmarking services also offer users the possibility                                Ac:ve user proﬁles                             Training item proﬁles
                                                                            A   2
to annotate the content of their items with metadata. In this section
we investigate the role such metadata can play in recommending                  3                                                                   3         A       C
                                                                            A
interesting bookmarks or references. We propose two different ap-                                                              similarity
                                                                            B   2                 D        1    2                                   4         C
proaches: content-based filtering and hybrid filtering. Before we                                                              matching

move on to describing these in Sections 4.1 and 4.2, we first take a        B   5                                                                   5         B       C
closer look at the metadata we have available.                                         training
                                                                            C   1
                                                                                         pairs
   In our approach we distinguish between item-intrinsic and item-                                                   (b) post‐centric matching
                                                                            C   3
extrinsic metadata. Item-intrinsic metadata fields relate directly to
                                                                                                  Ac:ve user's posts                                Training posts
the content of the item being annotated. For the two data sets deal-        C   4
ing with web bookmarks these include DESCRIPTION, TAGS, TITLE,                                        1         D                                       2         A
                                                                            C   5
and URL. The two scientific article data sets contain the additional
                                                                                1                     1         D                                       3         A
intrinsic fields ABSTRACT ,AUTHOR, BOOKTITLE, EDITOR, JOURNAL, NOTE,        D                                                  similarity
and SERIES. The intuition behind assigning metadata fields to the               2                     1         D
                                                                                                                               matching                 2         B
                                                                            D
item-intrinsic category is that these fields can be used as stand-
                                                                                3
                                                                                                          ...


                                                                                                                                                            ...


alone sources for recommending other content. For instance, given           D           test
a certain paper from a user’s profile, papers with similar abstracts,           4
                                                                                        pairs         2                                                 2
                                                                            D                                   D                                                 A
papers written by the same author, or papers published at the same
workshop are likely to be relevant recommendations. In contrast,
item-extrinsic metadata fields—such as MONTH or PAGES—cannot be           Figure 2: Visualization of our two content-based filtering ap-
used to directly generate appropriate recommendations. We per-            proaches to item recommendation for a small toy data set.
formed experimental runs using the metadata of each of our intrin-
sic fields separately. In addition, we experimented with the combi-
nation of all intrinsic fields, and with runs that combined all intrin-
   In both content-based filtering algorithms, we approach the rec-                               earlier in Section 3.1. As in CF, with our hybrid filtering algorithms
ommendation process from an IR perspective and restrict ourselves                                 we also distinguish between user-based filtering, where we gener-
to measuring textual similarity. We use the open-source retrieval                                 ate recommendations by determining the most similar users, and
toolkit Lemur to calculate the similarities between the different                                 item-based filtering, where we recommend the items most similar
user and item profiles. The Lemur toolkit5 implements different                                   to the items in the active user’s profile. Like in Section 4.1, we ap-
retrieval methods based on language modeling [20]. Preliminary                                    proach the first step from an IR perspective and calculate the textual
experiments comparing language modeling with the OKAPI model                                      similarities between users or items. For each user and each item we
and a tf·idf approach suggested a language modeling approach with                                 generate user and item profile representations, constructed as fol-
Jelinek-Mercer smoothing as the best-performing retrieval method.                                 lows. All of the metadata text of a user’s posts is collated into a
The language models we used are maximum likelihood estimates                                      single “user profile” for that user. Similarly, for the item-based ap-
of the unigram occurrence probabilities. We filter stopwords using                                proach we create item profiles for each item by concatenating all of
the SMART stopword list and do not perform stemming.                                              the metadata assigned to that item by all the users who have the item
                                                                                                  in their profile. This means that items are represented by their ag-
4.2          Hybrid Filtering                                                                     gregated community metadata and not just by a single user’s data.
   In addition to focusing solely on using the metadata for recom-                                Again, we used the open-source retrieval toolkit Lemur to calcu-
mendation, we also consider a hybrid approach that joins content-                                 late the similarities between the different metadata representations,
based filtering and CF, in the hope of combining the best of both                                 with the same experimental settings as described in Section 4.1.
worlds. Many different combination methods have been proposed
in earlier work [7]. In our hybrid filtering approach we view meta-
data in social bookmarking systems as another source of informa-                                  4.3    Results & Discussion
tion for locating the nearest neighbors of users and items in CF                                     Table 3 contains the best runs for each of the four metadata-based
algorithms. Figure 3 illustrates this approach. Instead of only look-                             algorithms, as well as our best CF run from Section 3. What we see,
ing at the overlap in items that two users have in common when                                    is that on three out of four data sets a recommendation algorithm
calculating user similarities, we can use the overlap in the metadata                             that uses metadata is better than the best CF run using data from the
applied to items to determine the most similar neighbors. Users that                              folksonomy. All of our best metadata runs use the combined meta-
describe their profile items using the same terminology are likely                                data fields. On their own, each field can be seen as an imperfect
share the same interests, making them a good source of recom-                                     representation of the items and users, but combined they alleviate
mendations. This is similar to the way we used the tag clouds of                                  each others weak points and better represent the content than they
users and items to calculate similarity between users and items in                                do separately. Only on the Delicious data set do all metadata-based
the previous section. The user and item similarities we derive in                                 approaches perform significantly worse than the CF runs. Unfortu-
this way are then plugged into the standard memory-based CF al-                                   nately, we do not have an explanation for this. When we compare
gorithms as described in Section 3.1. The resulting algorithm is a                                the metadata-based approaches with each other, we see that most
feature-augmented hybrid of CF and content-based filtering.                                       differences are not statistically significant. On the Bibsonomy Ar-
                                                                                                  ticles data set, the item-centric hybrid filtering approach is signif-
         1
                                                                                                  icantly better than the user-centric approach (p < 0.05). On the
     A                                      (a) user‐based ﬁltering
                                                                                                  CiteULike data set, the profile-centric approach also significantly
     A   2                Ac:ve user proﬁles                           Training user proﬁles      outperforms the post-centric and user-centric approaches.
     A   3
                                                                              1   2       3
                                                                                                     In general, we observe that the profile-centric approach tends
                                                                   A
                                                                                                  to outperform the post-centric approach on three of our four data
     B   2                                        similarity
                          D        1    2
                                                  matching         B          2   5               sets. This improvement is statistically significant for the CiteULike
     B   5                                                                                        data set with an improvement of 117% (p < 10−6 ). Only on the
               training                                            C          1   3       4   5   Delicious data set does post-centric matching perform significantly
     C   1
                 pairs                                                                            better (p < 0.05). This advantage of the profile-centric approach
     C   3
                                                                                                  is strongest on the article data sets where the profile-centric ap-
     C   4                                  (b) item‐based ﬁltering                               proach performs best for 75% of the all runs with different fields.
                          Ac:ve user's posts                              Training posts          In the case of hybrid filtering, the item-centric approach outper-
     C   5
                                                                                                  forms the user-centric approach on three of our four data sets. On
     D   1                                                                3                       the CiteULike and Bibsonomy Articles data sets these differences
                                                                                   A      C
         2
                           1        A   C   D         similarity                                  are statistically significant and especially large at 268% (p < 0.05)
     D                                                                    4
                                                      matching                        C           and 112% respectively (p < 0.01).
         3                 2        A   B   D
     D          test                                                                                 While we do not have room to report the results of all individ-
                                                                          5           B   C
         4
                pairs                                                                             ual intrinsic field runs, we can report on our general findings. For
     D
                                                                                                  all four approaches, the best-performing single fields are AUTHOR,
                                                                                                  DESCRIPTION, TAGS, and TITLE, which provide the best individual
Figure 3: Visualization of our two hybrid filtering approaches                                    results on all four data sets for all approaches. This is not surpris-
to item recommendation for a small toy data set.                                                  ing, as these fields are the least sparsely filled of all the intrinsic
                                                                                                  fields. In addition, these four fields are also aimed directly at de-
   Hybrid filtering also consists of two steps: (1) calculating the                               scribing the content of the items, more so than the conference or
most similar neighbors of the active user or his items, and (2) us-                               journal titles or the editors. Another interesting observation is that
ing those neighbors to predict item ratings for the active user. The                              the TITLE field served as a better source of user and item similar-
latter prediction step is performed in the same manner as described                               ity on the article data sets than on the bookmark data sets. This is
                                                                                                  because titles assigned to bookmarks are more variable than titles
5
    Available at http://www.lemurproject.org                                                      assigned to scientific articles, leading to this performance gap.
Table 3: Results comparison of the best metadata-based runs with our best folksonomic CF runs. Reported are the MAP scores
as well as the optimal number of neighbors N where applicable. The best-performing runs are printed in bold. The percentage
difference between our best meta-data approaches and the best CF runs is listed in the bottom row.

                                                                   bookmarks                             articles
                      Runs                               BibSonomy         Delicious        BibSonomy           CiteULike
                                                        MAP          N    MAP         N      MAP         N      MAP         N
                                                      0.0370H         3  0.0101H 23        0.1100H       7    0.0887H 30
                      Best CF run                     (it-jaccard-sim)    (it-bin-sim)      (it-tfidf-sim)    (i-bin-idf-sim)
                      Profile-centric filtering       0.0402H         -  0.0014H       -   0.1279H       -    0.0987H       -
                                                        (all intrinsic)     (TITLE)        (all intrinsic)     (all extrinsic)
                                                               H                 O                  H                 H
                      Post-centric filtering          0.0259          -  0.0036        -   0.1190        -    0.0455        -
                                                        (all intrinsic)      (TAGS)        (all intrinsic)     (all extrinsic)
                      User-centric hybrid filtering   0.0218H         2  0.0039O 13        0.0410H       2    0.0608H       2
                                                             (URL)       (all intrinsic)       (TITLE)            (TITLE)
                      Item-centric hybrid filtering   0.0399H        11  0.0017H      8    0.1510H 21 0.0746H 21
                                                            (TAGS)       (all intrinsic)   (all intrinsic)         (TAGS)
                      % Change over best CF run             +8.6%            -61.3%            +37.2%             +11.3%


5.      RELATED WORK                                                       rative Filtering for recommending interesting items, there has also
                                                                           been considerable work on content-based filtering, which can be
5.1      Folksonomic Recommendation                                        seen as an extension of the work done on information filtering.
   One of the first approaches to recommendation for social book-          Content-based filtering has been applied to many different domains.
marking websites was presented by [12], who proposed a graph-              Early work on content-based filtering included the NewsWeeder
based algorithm called FolkRank. They generated 2D projections             system by Lang et al. [14], which used the words contained in
of the tripartite graph and proposed a random walk model similar           newsgroup messages as its features. Alspector et al. [1] compared
to PageRank [17] that uses the steady state node probabilities as          a CF approach to movie recommendation with content-based filter-
the basis for ranking their recommendations. Clements et al. [9]           ing. For their content-based component they built metadata repre-
also proposed a random walk model for item recommendation, but             sentations of all movies using fields such as directory, genre, and
combine ratings information with tagging information into a single         awards, and used linear regression and classification and regression
model. They also incorporated self-transition probabilities in the         trees to learn user profiles and rank-order the items for those users.
matrix, and used the walk length as an algorithm parameter.                They found that CF performed significantly better than the content-
   There have also been several adaptations of memory-based algo-          based methods, but noted that this was likely due to the poor feature
rithms that include information about the tags assigned by users to        set they used. Mooney et al. [15] describe Libra, a content-based
items. Approaches that resemble our use of tag overlap for calculat-       book recommender system. They crawled the book metadata from
ing similarities between users and items include [2], [16], and [22].      the Amazon website and represented each book as a bag-of-words
Tso-Sutter et al. [23] proposed a novel tag-aware k-NN algorithm           vector. They then used a Naive Bayes classifier to learn user pro-
for item recommendation. When calculating the user and item sim-           files and to rank-order unseen books for the user.
ilarities they include the tags as additional items and users respec-         We are not the first to suggest the combination of CF with content-
tively. They then calculate cosine similarity on these extended pro-       based filtering, as the advantages of both approaches are largely
file vectors and fuse together the predictions of the user-based and       complementary. CF is the more mature of the two approaches and
item-based filtering runs. This fused model is able to effectively         works best in a situation with a stable set of items and a dense user
capture the relationship between users, items, and tags.                   base. Content-based filtering methods are better at dealing with
   Symeonidis et al. [21] were among the first to propose a model-         sparse, dynamic domains such as news filtering, and are better at
based approach to incorporating tagging information in recommen-           recommending for non-average users. Basu et al. [3] were among
dation. They propose an item recommendation approach that per-             the first to propose a hybrid recommender system that used both
forms tensor decomposition on the third-order folksonomy tensor.           collaborative and content features to represent the users and items.
By performing higher-order SVD, they approximate weights for               The collaborative features captured what movies a user likes and
each user-item-tag triple in the data set, which can then be used          the content features included metadata fields such as actors, direc-
to support item recommendation. They compared their algorithm              tors, genre, titles, and tag lines. They used Ripper, a rule-based
to the FolkRank algorithm [12], and found that tensor decomposi-           machine learning algorithm to predict which items are interesting,
tion outperforms the latter. Wetzker et al. [24] took a Probabilistic      and found that the combination of collaborative and content-based
Latent Semantic Analysis (PLSA) approach, which assumes a la-              features produced the best results. Claypool et al. [8] presented a
tent lower dimensional topic model. They extended PLSA by es-              weighted hybrid recommender system that calculated a weighted
timating the topic model from both user-item occurrences as well           average of the output of two separate CF and content-based filter-
as item-tag occurrences, and then linearly combined the output of          ing components. The CF component received a stronger weight
the two models. They tested their approach on a large crawl of             as the data sets grows denser, gradually phasing out the influence
Delicious, and found that it significantly outperforms a popularity-       of the content-based component. They did not find any significant
based algorithm.                                                           differences between the performance of the separate components
                                                                           or the combined version. Baudisch [4] proposed an innovative ap-
5.2      Exploiting Metadata for Recommendation                            proach to incorporating metadata into CF algorithms by joining the
     While a significant amount of research has focused on Collabo-        metadata descriptions to the user-item matrix as additional users.
6.    CONCLUSIONS                                                             Filters in an Online Newspaper. In Proceedings of ACM SI-
   In this paper we have presented a range of collaborative and               GIR Workshop on Recommender Systems, August 1999.
content-based approaches to item recommendation on social book-           [9] M. Clements, A. P. de Vries, and M. J. Reinders. Optimizing
marking websites. Our algorithms were evaluated on four realistic             Single Term Queries using a Personalized Markov Random
data sets of different domains, and compared to two external, state-          Walk over the Social Graph. In Proceedings of ESAIR ’08,
of-the-art approaches. Let us step back now and take stock of our             2008.
findings. Tags represent an additional layer of information in the       [10] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An
folksonomy that binds users and items together. These tags can                Algorithmic Framework for Performing Collaborative Filter-
be used successfully to improve the recommendations of standard               ing. In Proceedings of SIGIR ’99:, pp. 230–237, New York,
nearest-neighbor algorithms, but this depends on the algorithm. For           NY, USA, 1999. ACM.
item-based filtering, using tags for calculating item similarity alle-   [11] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T.
viates sparsity and results in better performance. At the user level,         Riedl. Evaluating Collaborative Filtering Recommender Sys-
however, tags do not offer the same benefits.                                 tems. ACM Transactions on Information Systems, 22(1):5–53,
   Metadata can also be used successfully to generate item recom-             2004.
mendations for social bookmarking websites. While the best ap-           [12] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme. Infor-
proach seems to be dependent on the data set and the domain, ag-              mation Retrieval in Folksonomies: Search and Ranking. In
gregating all of the intrinsic metadata at the user and item level            Proceedings of ESWC ’06, 2006.
results in algorithms that outperform the algorithms using only in-
                                                                         [13] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme.
formation from the folksonomy.
                                                                              BibSonomy: A Social Bookmark and Publication Sharing
   For future work, we intend to examine the benefits of data fusion.
                                                                              System. In Proceedings of the Conceptual Structures Tool
The tag-aware fusion approach by Tso-Sutter et al. [23] demon-
                                                                              Interoperability Workshop at ICCS 2006, pp. 87–102, 2006.
strates the potential of fusing together the outputs of different rec-
ommendations algorithms and representations.                             [14] K. Lang. NewsWeeder: Learning to Filter Netnews. In Pro-
                                                                              ceedings of ICML ’95, pp. 331–339, San Mateo, CA, USA,
                                                                              1995. Morgan Kaufmann.
Acknowledgments
                                                                         [15] R. J. Mooney and L. Roy. Content-Based Book Recommend-
The work described in this paper was funded by SenterNovem / the              ing Using Learning for Text Categorization. In Proceedings
Dutch Ministry of Economics Affairs as part of the IOP-MMI À                  of DL ’00, pp. 195–204, New York, NY, 2000. ACM Press.
Propos project, and by the Netherlands Organization for Scientific       [16] R. Nakamoto, S. Nakajima, J. Miyazaki, and S. Uemura. Tag-
Research as part of the NWO Vernieuwingsimpuls program.                       Based Contextual Collaborative Filtering. In Proceedings of
                                                                              the 18th IEICE Data Engineering Workshop, 2007.
7.    REFERENCES                                                         [17] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageR-
 [1] J. Alspector, A. Koicz, and N. Karunanithi. Feature-based and            ank Citation Ranking: Bringing Order to the Web. Technical
     Clique-based User Models for Movie Selection: A Compar-                  report, Stanford Digital Library Technologies Project, 1998.
     ative Study. User Modeling and User-Adapted Interaction, 7          [18] G. Salton and C. Buckley. Term-Weighting Approaches in
     (4):279–304, 1997.                                                       Automatic Text Retrieval. Information Processing & Man-
 [2] S. Amer-Yahia, A. Galland, J. Stoyanovich, and C. Yu. From               agement, 24(5):513–523, 1988.
     del.icio.us to x.qui.site: Recommendations in Social Tagging        [19] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Item-Based
     Sites. In Proceedings of SIGMOD ’08, pp. 1323–1326, New                  Collaborative Filtering Recommendation Algorithms. In Pro-
     York, NY, USA, 2008. ACM.                                                ceedings of WWW ’01, pp. 285–295, New York, NY, USA,
 [3] C. Basu, H. Hirsh, and W. W. Cohen. Recommendation as                    2001. ACM.
     Classification: Using Social and Content-Based Information          [20] T. Strohman, D. Metzler, and W. B. Croft. Indri: A Language
     in Recommendation. In Proceedings of the Fifteenth National              Model-based Search Engine for Complex Queries. In Pro-
     Conference on Artificial Intelligence, pp. 714–720, 1998.                ceedings of ICIA ’05, May 2005.
 [4] P. Baudisch. Joining Collaborative and Content-based Filter-        [21] P. Symeonidis, M. Ruxanda, A. Nanopoulos, and
     ing. In Proceedings of the ACM CHI Workshop on Interacting               Y. Manolopoulos.        Ternary Semantic Analysis of So-
     with Recommender Systems. ACM Press, May 1999.                           cial Tags for Personalized Music Recommendation. In
 [5] T. Bogers and A. Van den Bosch. Using Language Modeling                  Proceedings of ISMIR ’08, pp. 219–224, 2008.
     for Spam Detection in Social Reference Manager Websites.            [22] M. Szomszor, C. Cattuto, H. Alani, K. O’Hara, A. Baldas-
     In R. Aly, C. Hauff, I. den Hamer, D. Hiemstra, T. Huibers,              sarri, V. Loreto, and V. D. Servedio. Folksonomies, the Se-
     and F. de Jong, editors, Proceedings of the 9th Belgian-Dutch            mantic Web, and Movie Recommendation. In Proceedings of
     Information Retrieval Workshop (DIR 2009), pp. 87–94, En-                the ESWC Workshop on Bridging the Gap between Semantic
     schede, February 2009.                                                   Web and Web 2.0, 2007.
 [6] J. S. Breese, D. Heckerman, and C. Kadie. Empirical Anal-           [23] K. H. L. Tso-Sutter, L. B. Marinho, and L. Schmidt-Thieme.
     ysis of Predictive Algorithms for Collaborative Filtering. In            Tag-aware Recommender Systems by Fusion of Collabora-
     Proceedings of the Fourteenth Annual Conference on Uncer-                tive Filtering Algorithms. In Proceedings of SAC ’08, pp.
     tainty in Artificial Intelligence, pp. 43–52, 1998.                      1995–1999, New York, NY, 2008. ACM.
 [7] R. Burke. Hybrid Recommender Systems: Survey and Exper-             [24] R. Wetzker, W. Umbrath, and A. Said. A Hybrid Approach
     iments. User Modeling and User-Adapted Interaction, 12(4):               to Item Recommendation in Folksonomies. In Proceedings of
     331–370, 2002.                                                           ESAIR ’09, pages 25–29, New York, NY, USA, 2009. ACM.
 [8] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes,
     and M. Sartin. Combining Content-Based and Collaborative