Collaborative and Content-based Filtering for Item Recommendation on Social Bookmarking Websites Toine Bogers Antal van den Bosch ILK / Tilburg centre for Creative Computing ILK / Tilburg centre for Creative Computing Tilburg University Tilburg University P.O. Box 90153, 5000 LE P.O. Box 90153, 5000 LE Tilburg, The Netherlands Tilburg, The Netherlands A.M.Bogers@uvt.nl Antal.vdnBosch@uvt.nl ABSTRACT both the user and the items: past user preferences, purchase his- Social bookmarking websites allow users to store, organize, and tory, demographic information, item popularity, the metadata char- search bookmarks of web pages. Users of these services can an- acteristics of the products, etc. Social bookmarking websites, with notate their bookmarks by using informal tags and other metadata, their emphasis on open collaborative information access, offer an such as titles, descriptions, etc. In this paper, we focus on the task ideal scenario for the application of recommender systems technol- of item recommendation for social bookmarking websites, i.e. pre- ogy. They allow users to manage their favorite bookmarks online dicting which unseen bookmarks a user might like based on his or through a web interface and, in many cases, allow their users to her profile. We examine how we can incorporate the tags and other tag the content they added to the system with keywords. The ag- metadata into a nearest-neighbor collaborative filtering (CF) algo- gregation of these tags, the folksonomy, is an extra annotation layer rithm, by replacing the traditional usage-based similarity metrics connecting users and items. The underlying application then makes by tag overlap, and by fusing tag-based similarity with usage-based all information sharable among users. The success of such websites similarity. In addition, we perform experiments with content-based partly depends on how the connections between users, items, and filtering by using the metadata content to recommend interesting tags are exploited. items. We generate recommendations directly based on Kullback- Social bookmarking websites also offer possibilities for attach- Leibler divergence of the metadata language models, and we ex- ing item-specific metadata to each item, such as the item’s title or plore the use of this metadata in calculating user and item simi- summary. This additional metadata could also be used to support larities. We perform our experiments on three data sets from two the recommendation process. Using item metadata to boost perfor- different domains: Delicious, CiteULike and BibSonomy. mance is not new in recommender systems research, but typically such content-related information is attached to the items, and is therefore the same across all users [23]. On the other hand, the tags Categories and Subject Descriptors assigned to items are specific to the user who added them. We know H.3 [Information Storage and Retrieval]: H.3.1 Content Anal- of no approaches to item recommendation for social bookmarking ysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4 that investigate the use of such metadata. Systems and Software In this paper we focus on the question how we can make use the information represented by the folksonomy and the item meta- General Terms data to boost the performance of traditional collaborative filtering algorithms. We make the following contributions. First, we exam- Algorithms, Measurement, Performance, Experimentation ine different ways of extending the standard nearest-neighbor algo- rithm with information about tagging behavior. Second, we explore Keywords how best to use the metadata for recommendation by proposing Recommender systems, social bookmarking, folksonomies, collab- four different algorithms. Finally, we perform our experiments on orative filtering, content-based filtering publicly available data sets with standardized evaluation metrics to promote verifiability. The remainder of this paper is structured as 1. INTRODUCTION follows. We start by introducing our data sets and experimental setup in the next section. In Section 3 we describe and compare Recommender systems belong to a class of personalized infor- different CF algorithms that operate on the folksonomy to gener- mation filtering technologies that aim to identify which items in ate recommendations. Section 4 we describe how we exploit the a catalog might be of interest to a particular user. Recommenda- metadata in our data sets to generate item recommendations. We tions can be made using a variety of information sources related to describe the related work in Section 5 and conclude in Section 6. Permission to make digital or hard copies of all or part of this work for 2. METHODOLOGY personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to 2.1 Data Sets republish, to post on servers or to redistribute to lists, requires prior specific We base our experiments on four data sets that were collected permission and/or a fee. from three different social bookmarking websites with different char- ACM RecSys ’09 Workshop on Recommender Systems and the Social Web October 25, 2009, New York, NY, USA. acteristics: CiteULike, BibSonomy, and Delicious. Two data sets Copyright 2009 by the author(s). correspond to the domain of Web page bookmarks (Delicious and BibSonomy) and the other two cover the domain of scientific arti- 2.2 Experimental Setup & Evaluation cles (Delicious and BibSonomy). In order to evaluate and compare different recommender algo- CiteULike is a social bookmarking service that allows its users to rithms, we need a proper framework for experimentation and evalu- add their academic reference library to an online profile1 . Articles ation. Recommender systems evaluation—and the differences with can be stored with their metadata, abstracts, and links to the papers IR evaluation—has been addressed by, among others, [11]. We at the publishers’ sites. Users can also add personal comments and evaluate the “Find Good Items” task, also known as Top-N rec- tags. CiteULike offers daily dumps of their core database. We used ommendation, where users are provided with a ranked list of rec- the dump of November 2, 2007 as the basis for our experiments. ommended items based on their personal profile. We divide each A dump contains all information on which articles were posted by data set into a training and test set by randomly selecting 10% of whom, with which tags, and at what point in time. It does not, how- the users to be in our test set. Final performance is evaluated on ever, contain any other item metadata, so we crawled this ourselves this 10% by withholding 10 items from each of these so-called ac- from the CiteULike website using the article IDs. Articles are an- tive users, and using the remaining profile items together with the notated using the standard BibTeX-like fields, such as title, author training set to generate the recommendations for those 10%. If the names, page numbers, publisher information, etc. withheld items are predicted at the top of the ranked result list, then BibSonomy is a social bookmarking service for sharing Web the algorithm is considered to do perform well. To prevent over- page bookmarks and reference lists of scientific articles2 . Items estimation when optimizing algorithm parameters, we use 10-fold are stored and represented by their BibTeX metadata representa- cross-validation. We subdivide our training set into 10 folds and tions. These can include abstracts and links to the papers at the pub- use these for 10-fold cross-validation of our parameter optimiza- lishers’ websites. Users are able to tag their bookmarked content tion. For each fold, 10 items are withheld from the test fold users and use these tags to browse and discover related references [13]. to be retrieved by the recommendation algorithm. The final values BibSonomy’s creators organized the 2008 ECML/PKDD Discov- for our evaluation metric on the withheld items are then macro- ery Challenge which focused on social bookmarking, and released averaged over the 10 folds. the BibSonomy data set to the public in May 2008 as part of this In our evaluation, we adopt an IR perspective by treating each of challenge3 . The organizers made a snapshot of the BibSonomy the users as a separate query or topic. The 10 withheld items for system consisting of all resources posted to BibSonomy between each user constitute the items for which we have relevance judg- its inception in 2006 and March 31, 2008. It includes the same ments. Herlocker et al. [11] assess the usefulness of different met- type of article metadata as we collected for CiteULike. The dis- rics for different types of recommendation tasks. For the Top-N tinction between bookmarks and BibTeX records is also made in recommendation task, they find that metrics that take into account this snapshot. We therefore split this data dump into a data set con- the ranking of the items are most appropriate. We therefore evalu- taining only web bookmarks (Bibsonomy Bookmarks), and a data ate our algorithms using Mean Average Precision (MAP), which is set containing only scientific articles (Bibsonomy Articles). defined as the average of the Average Precision values calculated Delicious is a social bookmarking service for storing, sharing, over each relevant retrieved item. For determining significance of and discovering web bookmarks. It allows its users to manage differences between runs, we use a two-tailed paired Student’s t- and tag URLs of web pages4 . Unlike CiteULike and BibSonomy, test. We report on significant differences against the best baseline Delicious does not offer data dumps of their databases, so we gath- runs using M (and O ) for α = .05 and N (and H ) for α = .01. ered our data set by crawling a subset of the Delicious website. Be- cause of our focus on the task of item recommendation for users, our aim was to collect a balanced, unbiased set of user profiles, 3. FOLKSONOMIC RECOMMENDATION i.e. the complete set of bookmarks a user had posted to Delicious. We start by establishing some notation and definitions of the task From an earlier breadth-first crawl of Delicious we obtained a list at hand, based in part on notation by [9]. In the social bookmarking of 300,000 users. We randomly selected around 18,000 of these setting, users post items to their personal profiles and can choose users to match the size of our CiteULike data set, and crawled their to label them with one or more tags. We define a folksonomy to be entire profiles. the tripartite graph that emerges from this collaborative annotation of items. The resulting ternary relations that make up the tripar- 2.1.1 Data set filtering tite graph can be represented as a 3D matrix of users, items, and It is common practice in recommender system evaluation to se- tags. Figure 1 illustrates this view. We refer to the 3D matrix as lect realistic subsets of the data sets used to ensure that reliable D(uk , il , tm ). Here, each element d(k, l, m) of this matrix indicates recommendations can be generated. This also allows for a fair if user uk (with k = {1, . . . , K}) tagged item il (with l = {1, . . . , L}) comparisons of different recommendation algorithms [11]. This with tag tm (with m = {1, . . . , M}), where a value of 1 indicates the is typically done by filtering out users or items whose profile size ternary relation is present in the folksonomy. or popularity falls below a certain threshold. We follow this proce- In conventional recommender systems, the user-item matrix con- dure in our preparation of the data sets as well. We only retain the tains ratings information. These ratings can be explicit, when they users who have added 20 items or more to their personal profile. In are entered directly by the user, or implicit, when they are inferred addition, we filter out all items that occur only once, as well as all from user behavior. In our case we have implicit, unary ratings untagged posts. We were able to identify and filter out most of the where all items that were added by a user receive a rating of 1. spam content in the CiteULike and BibSonomy data sets. We refer We extract this ratings matrix R(uk , il ) for all user-item pairs di- the reader to [5] for more details about this process. Table 1 lists rectly from the tripartite graph. We denote its individual elements the statistics of our four data sets after filtering. by xk,l = {1, ∅}. Each user is represented in this matrix as its user profile row vector uk , which lists the items that user added to his or 1 http://www.citeulike.org/ her profile. Items are represented by the column vectors of R which 2 represent the item profile vectors il that contain all users that have http://www.bibsonomy.org/ 3 added that item. As shown in Figure 1, we can also extract a user- http://www.kde.cs.uni-kassel.de/ws/rsdc08/ 4 http://www.delicious.com/ item matrix from D by aggregating over the tag dimension. We then Table 1: Statistics of the filtered versions of our four data sets. bookmarks articles Delicious BibSonomy CiteULike BibSonomy # users 1,243 192 1,322 167 # items 152,698 11,165 38,419 12,982 # tags 42,820 13,233 28,312 5,165 # posts 238,070 29,096 84,637 29,720 user-item sparsity 99.8746 98.6427 99.8334 98.6291 avg # items per user 191.5 151.5 64.0 178.0 avg # users per item 1.6 2.6 2.2 2.3 avg # tags per user 192.1 203.3 57.3 79.2 avg # users per tag 5.6 2.9 2.7 2.6 avg # tags per item 4.8 8.4 5.3 3.1 avg # items per tag 17.0 7.1 7.3 7.7 items 3.1 Baseline Recommendation Algorithms A common and well-understood source of information for rec- ommendation is usage patterns of adding and rating items. The users D class of algorithms that exploit such patterns for recommendation purposes are called Collaborative Filtering algorithms (CF). In this tags paper we focus on using and extending the k-Nearest Neighbor (k- NN) algorithm. We pick the k-NN algorithm because it is a well Σ understood algorithm that can intuitively be extended to include other information in addition to transaction patterns [7, 10]. There i1 items iL i1 items iL are two flavors of the k-NN algorithm for CF: user-based filtering u1 u1 and item-based filtering. In user-based filtering, we locate the users most similar to the active users, and look among their items for new binarize recommendations. In item-based filtering, we locate the most simi- users R users UI lar items for each of the active user’s items and aggregate these into uK uK a list of predicted items. User-based Filtering. Figure 1: Representing the folksonomy graph as a 3D matrix. In the first step of user-based filtering we calculate the similari- The ratings matrix R is derived from the tripartite graph it- ties between pairs of users to identify the most similar users for an self, and directly represents what items were added by which active user. Many different similarity metrics have been proposed users. Aggregation over the tag dimension of D gives us matrix and evaluated over time, such as Pearson’s correlation coefficient UI, containing the tag counts for each user-item pair. We can and cosine similarity [6]. We use the cosine similarity in our ex- obtain R by binarizing the values in UI.. periments as it has often been used successfully on data sets with implicit ratings [6, 19]. We calculate the cosine similarity between the active user uk and another user ua on the user profile vectors uk obtain the K × L user-item matrix UI(uk , il ) = m=1 D(uk , il , tm ), and ua as simcosine (uk , ua ) = ||uukk||·u||uaa || . PM specifying how many tags each user assigned to each item. Be- The next step in user-based filtering is to determine the top N cause we filtered our data sets to include only tagged content, our similar users (or items) for user uk . We denote this set as the ratings matrix R is the same as a binary version of UI. Similar to Set of Similar Users SSU(uk ), which are the top N users of the the way we defined UI we can also aggregate the content of D over set of all users ua , ranked by their cosine similarity. For each user ua , we only consider those items that ua added to his pro- PLdimensions. We define the K × M user-tag ma- the user and the item trix UT(uk , tm ) = l=1 D(uk , il , tm ), specifying how often each user file (xa,l , ∅). Using this set of nearest neighbors we generate used certain tag to annotate his or her items. Individual elements the P final prediction scores b xk,l = xk,l for each unseen item il as b of UT are denoted by yk,m . We define the L × M item-tag matrix ua ∈SSU(uk ) simcosine (uk , ua ). Here, the predicted score of an item IT(il , tm ) = k=1 D(uk , il , tm ), indicating how many users assigned PK il is the sum of the similarity values (between 0 and 1) of all N a certain tag to an item. Individual elements of IT are denoted by nearest neighbors that actually added item il (i.e. xa,l , ∅). zl,m . We also define binary versions of UT and IT as UTbinary and A recurring observation from the literature about CF algorithms ITbinary . The row vectors of the UT and IT matrices represent the is that universally liked items are not as useful for capturing the user tag profiles dk and item tag profiles fl respectively. They list similarity between users as less common items, see e.g. [6]. We what tags have been assigned by a user to his items, or to an item therefore perform two runs: the ‘vanilla’ base run described above by its users. Formally, the goal of each of the recommendation al- (u-bin-sim) and a run where the values in the user vectors are weighted gorithms discussed in this paper is to produce a ranking of all items by the inverse user frequencies of the items (u-bin-idf-sim). l that are not yet in the profile of the active user uk (i.e., xk,l = ∅). To this end, we predict a score b xk,l for each item that represents the Item-based Filtering. likelihood of that item being relevant for the active user. The final The item-based k-NN algorithm operates analogously to the user- recommendations for a user are generated by ranking all items il by based filtering algorithm [19]. Instead of comparing users directly, their predicted score bxk,l . Table 2: Results of the folksonomic recommendation runs. Reported are the MAP scores as well as the optimal number of neighbors N. Best-performing runs for each data group of approaches are printed in bold. Best-performing tag overlap runs for both user- based and item-based are printed in bold. The percentage difference between the best baseline CF runs and the best tag overlap runs are indicated after each type. bookmarks articles Runs BibSonomy Delicious BibSonomy CiteULike MAP N MAP N MAP N MAP N Best baseline UB CF run 0.0277H 13 0.0046H 15 0.0865H 4 0.0757H 15 (u-bin-idf-sim) (u-bin-sim) (u-bin-sim) (u-bin-idf-sim) ut-jaccard-sim 0.0070H 8 0.0015H 11 0.0459H 6 0.0449H 5 ut-dice-sim 0.0069O 6 0.0007H 6 0.0333O 4 0.0439H 2 ut-bin-sim 0.0102H 5 0.0017H 11 0.0332 O 4 0.0452 H 3 O O H H ut-tf-sim 0.0069 2 0.0015 25 0.0368 4 0.0428 8 ut-tfidf-sim 0.0018O 6 0.0013O 17 0.0169H 2 0.0400H 2 % Change over best UB CF run -63.2% -63.0% -46.9% -40.7% Best baseline IB CF run 0.0244H 34 0.0027H 25 0.0737H 49 0.0887H 30 (i-bin-idf-sim) (i-bin-sim) (i-bin-idf-sim) (i-bin-idf-sim) it-jaccard-sim 0.0370H 3 0.0083M 21 0.0909H 6 0.0810H 14 it-dice-sim 0.0317H 2 0.0089M 25 0.0963H 8 0.0814H 8 it-bin-sim 0.0334H 2 0.0101M 23 0.0868H 5 0.0779H 10 it-tf-sim 0.0324H 4 0.0100M 11 0.0823 H 4 0.0607 H 17 it-tfidf-sim 0.0287H 8 0.0058H 7 0.1100H 7 0.0789H 21 % Change over best IB CF run +51.6% +274.1% +49.3% -8.2% % Change over best CF run +33.6% +119.6% +27.2% -8.2% we try to identify the best recommendations for each of the items rics, which means we calculate them on the binary vectors from the in an active user’s profile. In other words, for item-based filter- UTbinary matrix. The Jaccard Overlap sim Jaccard (dk , da ) between two ing we calculate the similarities between the test items of the ac- users dk and da is defined as |d k ∩da | |dk ∪da | . Dice’s coefficient simDice (dk , da ) tive user uk and the other items that user has not yet added (so k ∩da | is defined as 2|d . We refer to the user-based runs with the Jac- xk,l = ∅). Similarity between two items il and ib is calculated on the |dk |+|da | card overlap and Dice’s coefficient as ut-jaccard-sim and ut-dice- item profile vectors il and ib as simcosine (il , ib ) = ||ili||l ·i||ibb || . Next, we sim respectively. The cosine similarity is calculated in three dif- identify the top N most similar items for each of the active user’s ferent ways. First, we calculate it on the regular tag count vectors items il separately. We define this neighborhood as the Set of Sim- dk and da from UT as ut-tf-sim, and on the binary vectors from the ilar Items SSI(il ), where we select the top N of all items not al- UTbinary matrix as ut-bin-sim. In addition, we also experiment with ready in the active user’s profile, ranked on their cosine similarity idf-weighting of the tags in the user tag count vectors as we did be- simcosine (il , ib ) to item il . Using this set of nearest neighbors we fore. We refer to this run as ut-tfidf-sim. The item-based versions generate P the final prediction score b xk,l for each unseen item il as of these similarity metrics are calculated on the IT and ITbinary ma- xk,l = ib ∈SSI(il ) simcosine (il , ib ). Here, the predicted score is the sum b trices. We refer to these five item-based runs as it-jaccard-sim, of the similarity values (between 0 and 1) of all the most similar it-dice-sim, it-bin-sim, it-tf-sim, and it-tfidf-sim. items that were added by user uk (i.e. xk,b , ∅). Analogous to user-based filtering, we can also suppress the influence of the most ‘popular’ users, i.e. users that have added a disproportionately large 3.3 Results & Discussion number of items to their profile, such as bots or spam users. We re- Table 2 compares the results of our baseline CF runs that em- fer to the item-based filtering runs weighted with the inverse item ploy usage-based similarities to the runs that use overlap in tag- frequency as i-bin-idf-sim and to the unweighted runs as i-bin-sim. ging behavior as a source of user and item similarity. We see that the user-based filtering baseline outperforms item-based filtering on three of four data sets; only on CiteULike does item-based filter- 3.2 Tag Overlap Similarity ing work better, where this difference is also statistically significant The folksonomies present in our four data sets each constitute an (p < 0.05). The other differences between user-based and item- extra layer of connections between user and items. We exploit this based filtering are not significant. There appears to be no clear or extra layer for determining another type of similarity between users statistically significant advantage to applying idf-weighting to the or items. For instance, users that assign many of the same tags can profile vectors. An explanation for the advantage of user-based fil- be seen as similar, and items that are often assigned the same tags tering is that, according to Table 1, the average number of items per can also be seen as similar. user is much higher than the average number of users per item. Cal- We restrict ourselves to comparing three common similarity met- culating a meaningful overlap between user profile vectors could rics: Jaccard overlap, Dice’s coefficient, and the cosine similarity. therefore be more robust than between item profile vectors. We use these similarities as the basis for item recommendation. As for the results with tag overlap, we observe that item simi- The only difference between this approach and the standard CF al- larities based on tag overlap work well for item-based filtering, as gorithm is in the first step, where the similarities are calculated. three of our four data sets show considerable improvements over For user-based filtering, we calculate tag overlap on the UT ma- the best CF baseline runs. Performance increases range from 27% trix or on the binarized version UTbinary , depending on the metric. on Bibsonomy Articles to almost 120% on Delicious, but these Both the Jaccard overlap and Dice’s coefficient are set-based met- are only statistically significant on the Delicious data set. We see the opposite trend for user-based filtering, where tag overlap re- sic and extrinsic fields, resulting in a total of 34 runs per algorithm. sults in significantly worse scores for almost all variants on all data We did not test the extrinsic fields separately. Due to space restric- sets, with performance decreases ranging from 40% to 63%. This tions we only report the results of the best runs for each algorithm. means that using tag overlap in item-based filtering makes item- based filtering outperform user-based filtering on all four data sets. 4.1 Content-based Filtering We believe that it is the reduction in sparsity from using tag overlap The first approach we propose is content-based filtering where that causes this difference in performance. On average, the number the focus is on properly representing the content in our social book- of tags assigned to an item is 2.5 times higher than the number of marking data sets. Based on these representations our aim is to users who have added the item. This means that, on average, item construct an interest profile of an active user, and then use this profile vectors from the IT matrix are less sparse than item pro- profile to rank-order the unseen items by similarity to the profile, file vectors from the UI matrix, making the possibility of overlap thereby approximating possible interest in those items. Figure 2 between vectors more likely. Using more values in the similarity illustrates two different algorithms we propose for content-based calculation leads to a better estimate of the real similarity between filtering: profile-centric matching and post-centric matching. two items. The difference between our two content-based filtering algorithms For user-based filtering this difference is not as well-pronounced: is the level of aggregation. In our profile-centric matching ap- in some data sets users have more items than tags on average, and proach, we collate all of a user’s assigned metadata into a single more tags than items in other data sets. This explains why we do user profile. The intuition here is that by aggregating all of the not see the same performance increase for the user-based filtering metadata assigned by a user we can completely capture his or her runs based on tag overlap. The results of the different tag overlap interests. Similarly, we construct item profiles that collate all of the metrics tend to be close together and differences between them are metadata assigned to those items by all users in the training set. We not statistically significant. Even though the best-performing simi- then match the active user profiles against the item profiles on sim- larity metrics are dependent on the data set, we do see that the met- ilarity to produce a ranking of all items, as illustrated in the top half rics operating on the binary vectors from the UTbinary and ITbinary of Figure 2. After removing the items already in the active user’s matrices are consistently among the top performers. profile, we are left with the final rank-ordered list of recommenda- In general, it appears that bookmark recommendation is more tions. difficult than article recommendation. We believe this is due to a In contrast, post-centric matching operates on the level of indi- difference in topic specificity. The Delicious and Bibsonomy Book- vidual posts. We match each of an active user’s posts separately marks data sets cover bookmarks of web pages, which encompass against all the other posts of unseen items in the training set, as many more topics than scientific articles do. Users of Delicious illustrated in the bottom half of Figure 2. This leads to a list of and Bibsonomy Bookmarks can be expected to have more differ- matching posts in order of similarity for each of the active user’s ent topics in their profile, making it more difficult to recommend posts. Since retrieval scores are not directly comparable between new, interesting bookmarks based on their profiles. We see evi- runs, we normalize the original similarity scores simorg into [0, dence for this explanation in the average number of unique tags per 1] using the maximum and minimum similarity scores simmax and simorg −simmin user: 203.3 and 192.1 for Bibsonomy Bookmarks and Delicious simmin according to simnorm = simmax −simmin . We then calculate a respectively, which is markedly higher than the 79.2 and 57.3 for rank-corrected sum of similarity scores for each item il according to score(i) = log(rank(i P simnorm (il ) Bibsonomy Articles and CiteULike. . The final list of recommendations ranks l ))+1 every unseen item il by their rank-corrected score score(il ). 4. RECOMMENDATION USING METADATA In addition to the folksonomic structure of the underlying net- 1 (a) profile‐centric matching A work, social bookmarking services also offer users the possibility Ac:ve user profiles Training item profiles A 2 to annotate the content of their items with metadata. In this section we investigate the role such metadata can play in recommending 3 3 A C A interesting bookmarks or references. We propose two different ap- similarity B 2 D 1 2 4 C proaches: content-based filtering and hybrid filtering. Before we matching move on to describing these in Sections 4.1 and 4.2, we first take a B 5 5 B C closer look at the metadata we have available. training C 1 pairs In our approach we distinguish between item-intrinsic and item- (b) post‐centric matching C 3 extrinsic metadata. Item-intrinsic metadata fields relate directly to Ac:ve user's posts Training posts the content of the item being annotated. For the two data sets deal- C 4 ing with web bookmarks these include DESCRIPTION, TAGS, TITLE, 1 D 2 A C 5 and URL. The two scientific article data sets contain the additional 1 1 D 3 A intrinsic fields ABSTRACT ,AUTHOR, BOOKTITLE, EDITOR, JOURNAL, NOTE, D similarity and SERIES. The intuition behind assigning metadata fields to the 2 1 D matching 2 B D item-intrinsic category is that these fields can be used as stand- 3 ... ... alone sources for recommending other content. For instance, given D test a certain paper from a user’s profile, papers with similar abstracts, 4 pairs 2 2 D D A papers written by the same author, or papers published at the same workshop are likely to be relevant recommendations. In contrast, item-extrinsic metadata fields—such as MONTH or PAGES—cannot be Figure 2: Visualization of our two content-based filtering ap- used to directly generate appropriate recommendations. We per- proaches to item recommendation for a small toy data set. formed experimental runs using the metadata of each of our intrin- sic fields separately. In addition, we experimented with the combi- nation of all intrinsic fields, and with runs that combined all intrin- In both content-based filtering algorithms, we approach the rec- earlier in Section 3.1. As in CF, with our hybrid filtering algorithms ommendation process from an IR perspective and restrict ourselves we also distinguish between user-based filtering, where we gener- to measuring textual similarity. We use the open-source retrieval ate recommendations by determining the most similar users, and toolkit Lemur to calculate the similarities between the different item-based filtering, where we recommend the items most similar user and item profiles. The Lemur toolkit5 implements different to the items in the active user’s profile. Like in Section 4.1, we ap- retrieval methods based on language modeling [20]. Preliminary proach the first step from an IR perspective and calculate the textual experiments comparing language modeling with the OKAPI model similarities between users or items. For each user and each item we and a tf·idf approach suggested a language modeling approach with generate user and item profile representations, constructed as fol- Jelinek-Mercer smoothing as the best-performing retrieval method. lows. All of the metadata text of a user’s posts is collated into a The language models we used are maximum likelihood estimates single “user profile” for that user. Similarly, for the item-based ap- of the unigram occurrence probabilities. We filter stopwords using proach we create item profiles for each item by concatenating all of the SMART stopword list and do not perform stemming. the metadata assigned to that item by all the users who have the item in their profile. This means that items are represented by their ag- 4.2 Hybrid Filtering gregated community metadata and not just by a single user’s data. In addition to focusing solely on using the metadata for recom- Again, we used the open-source retrieval toolkit Lemur to calcu- mendation, we also consider a hybrid approach that joins content- late the similarities between the different metadata representations, based filtering and CF, in the hope of combining the best of both with the same experimental settings as described in Section 4.1. worlds. Many different combination methods have been proposed in earlier work [7]. In our hybrid filtering approach we view meta- data in social bookmarking systems as another source of informa- 4.3 Results & Discussion tion for locating the nearest neighbors of users and items in CF Table 3 contains the best runs for each of the four metadata-based algorithms. Figure 3 illustrates this approach. Instead of only look- algorithms, as well as our best CF run from Section 3. What we see, ing at the overlap in items that two users have in common when is that on three out of four data sets a recommendation algorithm calculating user similarities, we can use the overlap in the metadata that uses metadata is better than the best CF run using data from the applied to items to determine the most similar neighbors. Users that folksonomy. All of our best metadata runs use the combined meta- describe their profile items using the same terminology are likely data fields. On their own, each field can be seen as an imperfect share the same interests, making them a good source of recom- representation of the items and users, but combined they alleviate mendations. This is similar to the way we used the tag clouds of each others weak points and better represent the content than they users and items to calculate similarity between users and items in do separately. Only on the Delicious data set do all metadata-based the previous section. The user and item similarities we derive in approaches perform significantly worse than the CF runs. Unfortu- this way are then plugged into the standard memory-based CF al- nately, we do not have an explanation for this. When we compare gorithms as described in Section 3.1. The resulting algorithm is a the metadata-based approaches with each other, we see that most feature-augmented hybrid of CF and content-based filtering. differences are not statistically significant. On the Bibsonomy Ar- ticles data set, the item-centric hybrid filtering approach is signif- 1 icantly better than the user-centric approach (p < 0.05). On the A (a) user‐based filtering CiteULike data set, the profile-centric approach also significantly A 2 Ac:ve user profiles Training user profiles outperforms the post-centric and user-centric approaches. A 3 1 2 3 In general, we observe that the profile-centric approach tends A to outperform the post-centric approach on three of our four data B 2 similarity D 1 2 matching B 2 5 sets. This improvement is statistically significant for the CiteULike B 5 data set with an improvement of 117% (p < 10−6 ). Only on the training C 1 3 4 5 Delicious data set does post-centric matching perform significantly C 1 pairs better (p < 0.05). This advantage of the profile-centric approach C 3 is strongest on the article data sets where the profile-centric ap- C 4 (b) item‐based filtering proach performs best for 75% of the all runs with different fields. Ac:ve user's posts Training posts In the case of hybrid filtering, the item-centric approach outper- C 5 forms the user-centric approach on three of our four data sets. On D 1 3 the CiteULike and Bibsonomy Articles data sets these differences A C 2 1 A C D similarity are statistically significant and especially large at 268% (p < 0.05) D 4 matching C and 112% respectively (p < 0.01). 3 2 A B D D test While we do not have room to report the results of all individ- 5 B C 4 pairs ual intrinsic field runs, we can report on our general findings. For D all four approaches, the best-performing single fields are AUTHOR, DESCRIPTION, TAGS, and TITLE, which provide the best individual Figure 3: Visualization of our two hybrid filtering approaches results on all four data sets for all approaches. This is not surpris- to item recommendation for a small toy data set. ing, as these fields are the least sparsely filled of all the intrinsic fields. In addition, these four fields are also aimed directly at de- Hybrid filtering also consists of two steps: (1) calculating the scribing the content of the items, more so than the conference or most similar neighbors of the active user or his items, and (2) us- journal titles or the editors. Another interesting observation is that ing those neighbors to predict item ratings for the active user. The the TITLE field served as a better source of user and item similar- latter prediction step is performed in the same manner as described ity on the article data sets than on the bookmark data sets. This is because titles assigned to bookmarks are more variable than titles 5 Available at http://www.lemurproject.org assigned to scientific articles, leading to this performance gap. Table 3: Results comparison of the best metadata-based runs with our best folksonomic CF runs. Reported are the MAP scores as well as the optimal number of neighbors N where applicable. The best-performing runs are printed in bold. The percentage difference between our best meta-data approaches and the best CF runs is listed in the bottom row. bookmarks articles Runs BibSonomy Delicious BibSonomy CiteULike MAP N MAP N MAP N MAP N 0.0370H 3 0.0101H 23 0.1100H 7 0.0887H 30 Best CF run (it-jaccard-sim) (it-bin-sim) (it-tfidf-sim) (i-bin-idf-sim) Profile-centric filtering 0.0402H - 0.0014H - 0.1279H - 0.0987H - (all intrinsic) (TITLE) (all intrinsic) (all extrinsic) H O H H Post-centric filtering 0.0259 - 0.0036 - 0.1190 - 0.0455 - (all intrinsic) (TAGS) (all intrinsic) (all extrinsic) User-centric hybrid filtering 0.0218H 2 0.0039O 13 0.0410H 2 0.0608H 2 (URL) (all intrinsic) (TITLE) (TITLE) Item-centric hybrid filtering 0.0399H 11 0.0017H 8 0.1510H 21 0.0746H 21 (TAGS) (all intrinsic) (all intrinsic) (TAGS) % Change over best CF run +8.6% -61.3% +37.2% +11.3% 5. RELATED WORK rative Filtering for recommending interesting items, there has also been considerable work on content-based filtering, which can be 5.1 Folksonomic Recommendation seen as an extension of the work done on information filtering. One of the first approaches to recommendation for social book- Content-based filtering has been applied to many different domains. marking websites was presented by [12], who proposed a graph- Early work on content-based filtering included the NewsWeeder based algorithm called FolkRank. They generated 2D projections system by Lang et al. [14], which used the words contained in of the tripartite graph and proposed a random walk model similar newsgroup messages as its features. Alspector et al. [1] compared to PageRank [17] that uses the steady state node probabilities as a CF approach to movie recommendation with content-based filter- the basis for ranking their recommendations. Clements et al. [9] ing. For their content-based component they built metadata repre- also proposed a random walk model for item recommendation, but sentations of all movies using fields such as directory, genre, and combine ratings information with tagging information into a single awards, and used linear regression and classification and regression model. They also incorporated self-transition probabilities in the trees to learn user profiles and rank-order the items for those users. matrix, and used the walk length as an algorithm parameter. They found that CF performed significantly better than the content- There have also been several adaptations of memory-based algo- based methods, but noted that this was likely due to the poor feature rithms that include information about the tags assigned by users to set they used. Mooney et al. [15] describe Libra, a content-based items. Approaches that resemble our use of tag overlap for calculat- book recommender system. They crawled the book metadata from ing similarities between users and items include [2], [16], and [22]. the Amazon website and represented each book as a bag-of-words Tso-Sutter et al. [23] proposed a novel tag-aware k-NN algorithm vector. They then used a Naive Bayes classifier to learn user pro- for item recommendation. When calculating the user and item sim- files and to rank-order unseen books for the user. ilarities they include the tags as additional items and users respec- We are not the first to suggest the combination of CF with content- tively. They then calculate cosine similarity on these extended pro- based filtering, as the advantages of both approaches are largely file vectors and fuse together the predictions of the user-based and complementary. CF is the more mature of the two approaches and item-based filtering runs. This fused model is able to effectively works best in a situation with a stable set of items and a dense user capture the relationship between users, items, and tags. base. Content-based filtering methods are better at dealing with Symeonidis et al. [21] were among the first to propose a model- sparse, dynamic domains such as news filtering, and are better at based approach to incorporating tagging information in recommen- recommending for non-average users. Basu et al. [3] were among dation. They propose an item recommendation approach that per- the first to propose a hybrid recommender system that used both forms tensor decomposition on the third-order folksonomy tensor. collaborative and content features to represent the users and items. By performing higher-order SVD, they approximate weights for The collaborative features captured what movies a user likes and each user-item-tag triple in the data set, which can then be used the content features included metadata fields such as actors, direc- to support item recommendation. They compared their algorithm tors, genre, titles, and tag lines. They used Ripper, a rule-based to the FolkRank algorithm [12], and found that tensor decomposi- machine learning algorithm to predict which items are interesting, tion outperforms the latter. Wetzker et al. [24] took a Probabilistic and found that the combination of collaborative and content-based Latent Semantic Analysis (PLSA) approach, which assumes a la- features produced the best results. Claypool et al. [8] presented a tent lower dimensional topic model. They extended PLSA by es- weighted hybrid recommender system that calculated a weighted timating the topic model from both user-item occurrences as well average of the output of two separate CF and content-based filter- as item-tag occurrences, and then linearly combined the output of ing components. The CF component received a stronger weight the two models. They tested their approach on a large crawl of as the data sets grows denser, gradually phasing out the influence Delicious, and found that it significantly outperforms a popularity- of the content-based component. They did not find any significant based algorithm. differences between the performance of the separate components or the combined version. Baudisch [4] proposed an innovative ap- 5.2 Exploiting Metadata for Recommendation proach to incorporating metadata into CF algorithms by joining the While a significant amount of research has focused on Collabo- metadata descriptions to the user-item matrix as additional users. 6. CONCLUSIONS Filters in an Online Newspaper. In Proceedings of ACM SI- In this paper we have presented a range of collaborative and GIR Workshop on Recommender Systems, August 1999. content-based approaches to item recommendation on social book- [9] M. Clements, A. P. de Vries, and M. J. Reinders. Optimizing marking websites. Our algorithms were evaluated on four realistic Single Term Queries using a Personalized Markov Random data sets of different domains, and compared to two external, state- Walk over the Social Graph. In Proceedings of ESAIR ’08, of-the-art approaches. Let us step back now and take stock of our 2008. findings. Tags represent an additional layer of information in the [10] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An folksonomy that binds users and items together. These tags can Algorithmic Framework for Performing Collaborative Filter- be used successfully to improve the recommendations of standard ing. In Proceedings of SIGIR ’99:, pp. 230–237, New York, nearest-neighbor algorithms, but this depends on the algorithm. For NY, USA, 1999. ACM. item-based filtering, using tags for calculating item similarity alle- [11] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. viates sparsity and results in better performance. At the user level, Riedl. Evaluating Collaborative Filtering Recommender Sys- however, tags do not offer the same benefits. tems. ACM Transactions on Information Systems, 22(1):5–53, Metadata can also be used successfully to generate item recom- 2004. mendations for social bookmarking websites. While the best ap- [12] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme. Infor- proach seems to be dependent on the data set and the domain, ag- mation Retrieval in Folksonomies: Search and Ranking. In gregating all of the intrinsic metadata at the user and item level Proceedings of ESWC ’06, 2006. results in algorithms that outperform the algorithms using only in- [13] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme. formation from the folksonomy. BibSonomy: A Social Bookmark and Publication Sharing For future work, we intend to examine the benefits of data fusion. System. In Proceedings of the Conceptual Structures Tool The tag-aware fusion approach by Tso-Sutter et al. [23] demon- Interoperability Workshop at ICCS 2006, pp. 87–102, 2006. strates the potential of fusing together the outputs of different rec- ommendations algorithms and representations. [14] K. Lang. NewsWeeder: Learning to Filter Netnews. In Pro- ceedings of ICML ’95, pp. 331–339, San Mateo, CA, USA, 1995. Morgan Kaufmann. Acknowledgments [15] R. J. Mooney and L. Roy. Content-Based Book Recommend- The work described in this paper was funded by SenterNovem / the ing Using Learning for Text Categorization. In Proceedings Dutch Ministry of Economics Affairs as part of the IOP-MMI À of DL ’00, pp. 195–204, New York, NY, 2000. ACM Press. Propos project, and by the Netherlands Organization for Scientific [16] R. Nakamoto, S. Nakajima, J. Miyazaki, and S. Uemura. Tag- Research as part of the NWO Vernieuwingsimpuls program. Based Contextual Collaborative Filtering. In Proceedings of the 18th IEICE Data Engineering Workshop, 2007. 7. REFERENCES [17] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageR- [1] J. Alspector, A. Koicz, and N. Karunanithi. Feature-based and ank Citation Ranking: Bringing Order to the Web. Technical Clique-based User Models for Movie Selection: A Compar- report, Stanford Digital Library Technologies Project, 1998. ative Study. User Modeling and User-Adapted Interaction, 7 [18] G. Salton and C. Buckley. Term-Weighting Approaches in (4):279–304, 1997. Automatic Text Retrieval. Information Processing & Man- [2] S. Amer-Yahia, A. Galland, J. Stoyanovich, and C. Yu. From agement, 24(5):513–523, 1988. del.icio.us to x.qui.site: Recommendations in Social Tagging [19] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Item-Based Sites. In Proceedings of SIGMOD ’08, pp. 1323–1326, New Collaborative Filtering Recommendation Algorithms. In Pro- York, NY, USA, 2008. ACM. ceedings of WWW ’01, pp. 285–295, New York, NY, USA, [3] C. Basu, H. Hirsh, and W. W. Cohen. Recommendation as 2001. ACM. Classification: Using Social and Content-Based Information [20] T. Strohman, D. Metzler, and W. B. Croft. Indri: A Language in Recommendation. In Proceedings of the Fifteenth National Model-based Search Engine for Complex Queries. In Pro- Conference on Artificial Intelligence, pp. 714–720, 1998. ceedings of ICIA ’05, May 2005. [4] P. Baudisch. Joining Collaborative and Content-based Filter- [21] P. Symeonidis, M. Ruxanda, A. Nanopoulos, and ing. In Proceedings of the ACM CHI Workshop on Interacting Y. Manolopoulos. Ternary Semantic Analysis of So- with Recommender Systems. ACM Press, May 1999. cial Tags for Personalized Music Recommendation. In [5] T. Bogers and A. Van den Bosch. Using Language Modeling Proceedings of ISMIR ’08, pp. 219–224, 2008. for Spam Detection in Social Reference Manager Websites. [22] M. Szomszor, C. Cattuto, H. Alani, K. O’Hara, A. Baldas- In R. Aly, C. Hauff, I. den Hamer, D. Hiemstra, T. Huibers, sarri, V. Loreto, and V. D. Servedio. Folksonomies, the Se- and F. de Jong, editors, Proceedings of the 9th Belgian-Dutch mantic Web, and Movie Recommendation. In Proceedings of Information Retrieval Workshop (DIR 2009), pp. 87–94, En- the ESWC Workshop on Bridging the Gap between Semantic schede, February 2009. Web and Web 2.0, 2007. [6] J. S. Breese, D. Heckerman, and C. Kadie. Empirical Anal- [23] K. H. L. Tso-Sutter, L. B. Marinho, and L. Schmidt-Thieme. ysis of Predictive Algorithms for Collaborative Filtering. In Tag-aware Recommender Systems by Fusion of Collabora- Proceedings of the Fourteenth Annual Conference on Uncer- tive Filtering Algorithms. In Proceedings of SAC ’08, pp. tainty in Artificial Intelligence, pp. 43–52, 1998. 1995–1999, New York, NY, 2008. ACM. [7] R. Burke. Hybrid Recommender Systems: Survey and Exper- [24] R. Wetzker, W. Umbrath, and A. Said. A Hybrid Approach iments. User Modeling and User-Adapted Interaction, 12(4): to Item Recommendation in Folksonomies. In Proceedings of 331–370, 2002. ESAIR ’09, pages 25–29, New York, NY, USA, 2009. ACM. [8] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and M. Sartin. Combining Content-Based and Collaborative