Introduction

A Probabilistic Ranking Approach for Tag Recommendation

Zhen Liao

Maoqiang Xie

Hao Cao

caohaog@mail.nankai.edu.cn 1

Yalou Huang

huangylg@nankai.edu.cn 1 0 College of Information Technology Science, Nankai University , Tianjin , China 1 College of Software, Nankai University , Tianjin , China

Social Tagging is a typical Web 2.0 application for users to share knowledge and organize the massive web resources. Choosing appropriate words as tags might be time consuming for users, thus a tag recommendation system is needed for accelerating this procedure. In this paper we formulate tag recommendation as a probabilistic ranking process, especially we propose a hybrid probabilistic approach which combines language model and statistical machine translation model. Experimental results validate the e ectiveness of our method.

Introduction

Folksonomy is a way to categorize Web resources via utilizing the \wisdom" of web users, nowadays it is existing in many web applications such as Delicious3, Filckr4, Bibsonomy5. One user could create and share her knowledge during the tagging on resources that are interesting to her. Web resources come in many forms, for example, one resource could be a Web pages, a published paper, or a book. To tag a resource with appropriate words is not so easy and might cost lots of time. Thus a tag recommendation system is needed for easing the timeconsuming step. Typically a recommendation system would suggest 5 or 10 tags to the user for a given resource. Those suggested tags would help one user to think about eligible words and to realize the interesting aspects concerned by others. To solve the problems, ECML PKDD holds the second round discovery challenge6 of tag recommendation. This paper presents a probabilistic ranking approach submitted to the challenge.

Given a resource, users choose tags by di erent aspects of the resource and their speci c interests. To pick up a tag from the entire tag set and assign it to the resource could be formulated as following process: given a resource and a user, ranking the tags by their relevance to the resource and user. Here relevance denotes the `value' of how likely the user would label this tag on this resource.

3 http://del.icio.us 4 http://www. ickr.com/ 5 http://www.bibsonomy.org/ 6 http://www.kde.cs.uni-kassel.de/ws/dc09

We suppose a tag recommendation system works best while recommending tags are sorted by the relevance and then suggested to the user.

In this paper, the datasets provided by Bibsonomy is a set of post. Each post denotes a triple fuser, resource, a set of tagg. A resource type could be bookmark or bibtex, where bookmark is Web page and bibtex is publication. Both bookmark and bibtex resources contain many elds: URL, description, etc. The textural information in the elds could be merged as a pseudo document.

A natural way of choosing tags is to select words from the pseudo document of given resource. A TF-like maximum likelihood method could reach the goal. The important problem is that maximum likelihood model could not generate tags which are meaningful but not existing in the document. To incorporate previously popular tags and tags preferred by a user, a tag recommendation model could be formulate into language model smoothed via Jelinek-Mercer method as described in Section 3.2. However, the language modeling approach could not learn the word-tag relateness which re ects how other users choose tags for those words in the document. Since the textural information existing in a post could be considered as a parallel corpus - fwords in document, tagsg, we propose to use the statistical machine translation approach to learn the translation probability from words to tags.

Finally, we propose a candidate set based tag recommendation algorithm which generates candidate tags from the textual elds of a resource using maximum likelihood and statistical machine translation model. The e ectiveness of our approach is validated on the bookmark and bibtex tagging test datasets provided by Bibsonomy. While textural content of a bookmark resource is inadequate, we utilize the tags used within same Domain to extend the candidate set. We also found simple co-occurrence based translation probability estimation performs as good as IBM Model 1 [6] which uses the EM algorithm to learn the translation probability. An advantage of co-occurrence based approach is its convenience for handling with new training data, since training the model is just counting the co-occurrence of words and tags. However, EM-based approach needs to re-train translation model though iterations which might be time consuming for large scale dataset.

The rest of this paper is organized as follows. In Section 2 the related work is surveyed. In Section 3 our content based tag recommendation models are presented, and the recommendation algorithm is described in Section 4. In Section 5 we descrbe the data format and preprocessing step, and experimental results are reported in Section 6. Finally in Section 7 we conclude this paper and give out some possible future research issues. 2

Related Work

Most of existing tag recommendation approaches are based on the textual information of the resource and previous interests of users. Up to now, the information retrieval, data mining and natural language processing techniques have been used for solving the tag recommendation problem.

Heymann et al. [1] use one of the largest crawls from the social bookmarking system Delicious and presents studies of the factors which could impact the performance of tag prediction. The predictability of tags is measured by some method such as entropy based metric. The tag-based association rule is proposed to assist tag predictions. The method of learning the word-tag relateness via association rule needs to tune the con dence and support to nd meaningful rules, but we transfer it into the translation probability which could get the converged solution without tuning.

Tatu et al [2] uses document and user models derived from the textual content associated with URLs and publications by social bookmarking tool users. The natural language processing techniques are used to extract the concept(Part of Speech, etc.) from the textual information. WordNet7 are used to stem the concepts and link synonyms. The di erence between our work and theirs is that they expand the concept via WordNet, but do not have the word-to-tag translation probability such as from `eclipse' to `java'.

Lipczak [3] focus on the folksomomies towards individual users, and proposed a three step tag recommendation system which conducts the Personmony based ltering using previously used tags of users after the extraction and retrieving of tags. The recommendation approach in [3] is similar with our work, but the scores of candidate tags are computed di erently. They use the multiply strategy for di erent factors, but we conduct a weighted sums in which the weight could be set to prefer di erent components. Besides, we use the statistical machine translation approach to learn the word-tag relateness which is di erent from model proposed in [3].

Language modeling approach [4] has been applied in Information Retrieval with lots of smoothing strategies [5]. The statistical machine translation approaches [6] shows its theoretical soundness and e ectiveness in translation, and Berger et al [7] and Xue et al [8] incorporate the statistical translation approaches into information retrieval and automatic question answering elds. The theoretical soundness and e ectiveness make it stable to adopt the language modeling and statistical machine translation approach into tag recommendation. The statistical machine translation approach also naturally solve the problem of learning the word-tag relateness of sharing the common tagging knowledge among users. 3 3.1

Content Based Tag Recommendation Models Problem De nition

Q In this paper, a tag set is denoted as t = ftigi=1 where ti is a single word or term and Q is the number of tags in t.

The tag recommendation task is to suggest a tag set t for a user Uk while given a bookmark/publication resource Rj which might be a web page, a book or paper etc. The resource Rj contains several elds such as URL, title, description and we denote the resource content as a pseudo document Dj .

7 http://wordnet.princeton.edu

Suppose the recommendation system is required to suggest N tags, it is to nd N tags ftigiN=1 from the entire tag sets with the biggest probability p(tijUk; Dj ).

For solving the task, a training set S = fSigiK=1 is given, where Si speci es a triple fti; U i; Dig. The ti is a tag set, U i 2 U = fU1; :::; UM g is a user and Di 2 D = fD1; :::; DN g is a resource . Then we can learn a tag recommendation model M from S.

At the testing stage, a testing set T = fT j gjP=1 where T j = fU j ; Dj g is given. The model M is asked to suggest tag set tj for each T j . After that a groudtruth tag sets G = fgj gjP=1 is used to judge the recommendations ftj gjP=1, and the performance is get via some evaluation measures such as Precision, Recall and F-measure.

For a speci c user Uk, she would have her preference in choosing a word ti as a tag, and if we have this user's information in the training set S, we can formulate this preference as P (tijUk) = c(ti;Uk) where c(ti; Uk) is frequency of ti jUkj be used by user Uk, and jUkj is total frequency of all tags used by Uk.

We de ne the tag generating probability a tag ti for a given user and document tuple fUk; Dj g as:

P (tijDj ; Uk) = ( 1 )P (tijDj ) + Where is a trade-o parameter between the resource content and user.

Following we will introduce language model and statistical machine translation approaches for estimating P (tijDj ), and then we will combine them into our nal model. A natural and simple way to estimate P (tijDj ) is to use the maximum likelihood approach as:

Pml(tijDj ) = c(ti; Dj ) jDj j

Where c(ti; Dj ) is occurrence of ti in Dj , and jDj j is document length of Dj . The shortcoming of the maximum likelihood estimation is that it could not generate tag which does not exist in Dj , thus we introduce language model smoothed via Jelinek-Mercer method [5] as:

Plm(tijDj ) = ( 1 )Pml(tijDj ) +

Pml(tijC) (3)

Where is the smoothing parameter, and C corresponds to the entire corpus. Actually the smoothing term P (tijC) could be formulated as the probability of the word ti be used as a tag. We de ne P (tijC) as #c(tatig)s where #tags is the total number of tags in the training set S. The language modeling approach (3) could be considered as the incorporation of words in the document and previously popular tags of all users. ( 1 ) ( 2 ) 3.3

Statistical Machine Translation Approach

However, the language modeling approach has not considered word-tag relateness which would be important for tag recommendation. For solving the problem, we further introduce the Statistical Machine Translation(SMT) approach [6] [7] [8] for estimating the probability P (tijDj ):

Psmt(tijDj ) = jDjjDj j+j 1 Ptr(tijDj ) + jDj 1j+ 1 P (tijnull)

Where P (tijnull) could be regarded as the background smoothing model P (tijC), and a more detailed comparison them could be found in [8]. Ptr(tijDj ) is the translation probability from Dj to ti as following:

Ptr(tijDj ) =

X Ptr(tijw)Pml(wjD) w2Dj

To learn the word-word transition probability Ptr(tijw), the EM algorithm could be used. The detail of EM algorithm of learning the word-tag relateness P (tijw) in Statistical Machine Translation(SMT) Model is described in [6]. In the training set S = fSj gjK=1, the parallel corpus of tag and document as Sj = ftj ; Dj g is utilized, and the EM step for learning P (tijw) can be formulated as: E-Step:

Pt1r(tijw) =

K w1 X c(ti; w; tj ; Dj )

j=1 M-Step: c(ti; w; tj ; Dj ) =

P (tijw) P (tijw1) + ::: + P (tijwo) #(ti; tj )#(w; Dj ) (4) (5) (6) (7) (8)

In Equation (6) w1 = Pti PjK=1 c(ti; w; tj ; Dj ) is the normalization factor. In Equation (7) fw1; :::; wog is words contained in Dj , #(ti; tj ) and #(w; Dj ) is the number of ti in tj and number of w in Dj . The convergency of this EM algorithm is proved in [6].

In this paper, we also nd that the co-occurrence based translation probability could be helpful in tag recommendation, and we denote it as: Pt2r(tijw) =

PK j=1 #(ti; tj ) #(w; Dj ) PK

j=1 #(w; tj ; Dj )

Where #(ti; tj ) denotes the number of tag ti exists in tj and the same to #(w; Dj ). This model could be regarded as a simple approximation of the EM based translation model, and it is also e ective. Note that the EM based translation probability is denoted as Pt1r(tijw) whereas the co-occurrence based translation probability is denoted as Pt2r(tijw) hereafter. 3.4 Now we combine above methods together to get our nal model: Pfinal(tijDj ; Uk) = P (tijC) +

Where + + + = 1 and Ptr could be Pt1r or Pt2r. Tuning these four parameters is not easy, and thus we split both Cleaned Dump and Post Core dataset into a training set and a validation set respectively, train the model on the training set and set parameters empirically several times for choosing one with better performance on the validation set. We do not illustrate the detail due to space restriction, and in the experiments we found the performance is relatively well while = 0:15, = 0:1, = 0:05, = 0:7. We use these parameters with Cleaned Dump dataset as our nal training set for the challenge. 4

Candidate Set based Tag Recommendation Algorithm

Since the task of tag recommendation is to suggest tags for given document and user, it is di erent from the task of Information Retrieval [7] or Question Answering [8] where the query/question is given for nding the relevant documents/answers.

Given a document Dj and user Uk, we rstly nd a recommendation tag candidate set CS from the words in Dj , and we also add the top L related words by Ptr(tjw) for every word w in Dj . Then we compute the P (tijDj ; Uk) for each tag ti 2 CS. Finally we sort the tags descending according to P (tijDj ; Uk), and return the top N tags as required by the application system. The L is set to be 20 and N is set to 5 in the experiments. In summary, we get this algorithm in Table 1. 5

Data Preparing and Preprocessing

The dataset we used is download from ECML PKDD Discovery Challenge 20098 which is provided by BibSonomy9. There are two datasets: Cleaned Dump and Post Core. The Cleaned Dump contains all public bookmarks and publication posts of BibSonomy until (but not including) 2009-01-01. The Post Core is a subset of the Cleaned Dump, it removes all users, tags, and resources which appear in only one post from Cleaned Dump. Brief statistics of Cleaned Dump and Post Core could be found in Table 2. One tag assignment means one user choose a tag for a resource, and thus one posts could have several tag assignments. The number of posts are shown for bookmark, bibtex, and entire set. The bookmark and bibtex are seperated by `/', and the entire set are illustrated after `:'.

8 http://www.kde.cs.uni-kassel.de/ws/dc09 9 http://www.bibsonomy.org/

There are three tables tas, bookmark, and bibtex in the dataset. The elds of these tables are list in Table 3. For bookmark resource the eld `content type' is 1 and that of bibtex resource is 2. The elds in bold are used to generate the pseudo document Dj and the tags tj in the training process.

We rstly remove the stop words in the bookmark and bibtex table since they are seldom used as tags and usually meaningless. The stop word list are download from Lextek10. Note that we do not remove stop words in the tas le, and the top 5 stop words exist in Post Core and their frequency could be found in Table 4. There are totally 19, 647 and 2, 513 stop word tag assignments in Cleaned Dump and Post Core, corresponds to 1.39% and 0.99% respectively. 10 http://www.lextek.com/manuals/onix/stopwords1.html In contrast, the total frequency of stop words in pseudo documents of Cleaned Dump and Post Core are over 588, 907 and 61, 113, which suggest not to consider stop words as tags in most cases. dataset top 5 stop words and their frequency in tags Cleaned Dump all:3105 of:1414 and:1227 best:1124 three:1081 c:806

Post Core all:655 open:211 c:165 best:152 work:77

In Table 5 we list out the top 10 tags in Cleaned Dump and Post Core. We could see later that the co-occurrence based translation model are likely to generate words which appear more times.

Experimental Result Tagging Performance

The evaluation measure in following experiments are widely used Precision, Recall, and F1-measure. The testing datasets are released by ECML-PKDD challenge in tasks. There are 2 tasks: task 1 and task 2, where task 1 is for content based tag recommendation, and task 2 is for graph based tag recommendation11. In task 1 the user, resource of a post might not exist before, so the content information of the resource would be critical for tag recommendation. In task 2 user, resource, and tags of each post in the test data are all contained in the Post Core dataset, thus it intends for methods relying on the graph structure of the training data only.

We use the whole Cleaned Dump dataset as the training set to train the model and test the performance of our model on both tasks. For choosing the parameters, we set = 0:15; = 0:05; = 0:1; = 0:7 as mentioned before in Section 3.4. The results are shown in Figure 1. The nal em denotes nal model with Pt1r(EM-based), and nal co denotes nal model with Pt2r(Co-occurrence based). The x-axis is the top position and y-axis is the f-measure. 11 http://www.kde.cs.uni-kassel.de/ws/dc09

The results indicates that although Pt2r(Co-occurrence) is more simpler, it is comparable to Pt1r. In our previous experiment, we also found sometimes the textual information from the bookmark resource are not adequate enough to generate some tags in the post and it needs to be expanded. Instead of using extrinsic resource such as WordNet, we aggregate the tags in the same web site domain for bookmark resource, and use them to expand the recommendations. The reason we don't expand the term in bibtex is because resources in bibtex are publication and the web site provide less information about tags. Also, trying other tag expansion methods would be our future work. We formulate this expansion as P (tijSite), and the recommendation model for bookmark would become:

Pfinal ex(tijDj ; Uk) = P (tijC) +

P (tijUk) +

Pml(tijDj ) +

X Ptr(tijw)Pml(wjD) + P (tijSite) w (10)

For illustrate the expansions of di erent domains, we sample some domains and their top used tags with the probability in Table 6. domain tags and their previously used probability www.apple.com apple:0.17 mac:0.13 software:0.09 osx:0.07 bookmarks:0.07 answers.yahoo.com knowledge:0.14 yahoo:0.14 web20:0.07 all:0.07 answer:0.07 ant.apache.org java:0.19 ant:0.17 programming:0.07 apache:0.07 tool:0.07 picasa.google.com google:0.21 image:0.14 download:0.14 linux:0.14 picasa:0.14 research.microsoft.com microsoft:0.10 research:0.09 people:0.04 social:0.04 award:0.03 www.research.ibm.com ibm:0.11 datamining:0.07 software:0.04 machinelearning:0.04 journal:0.04

After the tag expansion via the URL domain, the candidates set CS for the recommendation will have top used tags in the same domain of Dj . The performance of (10) with the expansions on the testing set are shown in Table 7 and 8. The performance are shown for only bookmark, only bibtex, and on entire set. The bookmark and bibtex are seperated by `/', and the entire set are illustrated after `:'. We choose the co-occurrence based model Pt2r in the competition, and actually the performance in terms of F-measure at 5 is also good when using EM-based model Pt1r. The F-measure of EM-based model with the same parameters as Table 7 for task 1 and task 2 are shown in Table 9. We can nd that the Pt2r and Pt1r are comparable once again, on F-measure at 1, the Co-occurrence based model are better, but on F-measure at 5, the EM-based model are better. Next we conduct the experiment on each component of our nal model (9), the document maximum likelihood method, language model(`LM + User Model'), the EM-based translation model Pt1r(tijw), and co-occurrence based translation model Pt2r(tijw) are chosen. In the `LM + User Model' we set the parameters = 0:5; = 0:3; = 0:2; = 0. It could be considered as the language model which incorporates the maximum likelihood, the previously tag probability in the whole corpus, and the user's preference model. The performance on both testing datasets of task 1 and task 2 are illustrated in Figure 2. The x-axis is the top position from top1 to top5 and the y-axis is the value of F-Measure. We only list out the F1 measure because it re ects both precision and recall.

From the experimental results we can see the translation based models are better than maximum likelihood method and `LM + User Model' in task 2. The co-occurrence based model are worst in task 1, and the EM-based model is better than co-occurrence based model on both task. We analyze the results of cooccurrence based model on task 1 and nd many recommendations are common used tags, because the co-occurrence based model would prefer to generate those tags occurred more times before. This suggest that if the resource/users have been seen before, thus the co-occurrence based model would perform well, if not, then it is better to choose EM based model. The `LM + User Model' perform best on task 1, but the performance is still lower than that in Table 7, and also, `LM + User Model' performs worse than translation models on task 2.

For comparison between EM-based and co-occurrence based model, we pick out several words w with their top translating words ti in both Pt1r(tijw)(EMbased) and Pt2r(tijw)(Co-occurrence based). The sampling words could be found in Table 10. We could nd that in EM-based translation model, the words are most likely to translate into itself. It indicates that we could consider the EMbased translation model as the combination of the maximum likelihood which only generates the word it self and the co-occurrence based translation model which has higher probability to generate other words as tags. The co-occurrence model are likely to generate those popular tags in the corpus, such as `tools', `software', `social'. In this paper we propose a probabilistic ranking approach for tag recommendation. The textual information from the resources and the parallel textual corpus from previously posts are used to learn the language and statistical translation model. Our hybrid probabilistic approach incorporates both the content based textural model and graph structure existing in posts for sharing the common tagging knowledge among users.

As our future work, we intent to study how to choose parameters via machine learning approaches to avoid heuristic setting. Further more, increasing the extra information of the resources, for example, using the citations(references) of a publication to augment the information of bookmark resource; using other tag expansion techniques; conducting the natural language understanding of the tag concept as well as studying the evaluation measures for tag recommendation are all possible future research work.

Acknowledgement

This paper is supported by the National Natural Science Foundation of China under the grant 60673009 and China National Hanban under the grant 2007-433. The authors thank Chin-Yew Lin at Microsoft Research Asia for his valuable comments to this paper. Thanks also to Jie Liu, Yang Wang and Min Lu for their helpful discussions and suggestions. 3. Lipczak, M. Tag Recommendation for Folksonomies Oriented towards Individual Users. In Proceedings of ECML PKDD Discovery Challenge (RSDC 2008), pages 84-95. 4. Ponte, J. M. and Croft, W.-B. A Language Modeling Approach to Information Retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 1998), pages 275-281. 5. Zhai, C.-X. and La erty, J. A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Transaction of Information System 2004, pages 179-214. 6. Brown, P.-F., Pietra, V. J. D., Pietra, S. A. D. and Mercer, R.-L. The Mathematics of Statistical Machine Translation: Parameter Estimation. Journal of Computational Linguist 1993, pages 263-311. 7. Berger, A. and La erty, J. Information Retrieval as Statistical Translation. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 1999), pages 222-229. 8. Xue, X., Jeon, J. and Croft., W.-B. Retrieval Models for Question and Answer Archives. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 2008), pages 475-482.

1. Heymann , P. and Ramage , D. and Garcia-Molina , H.

Social Tag

Prediction . In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 2008 ), pages 531 - 538 .

2. Tatu , M. , Srikanth , M. and D'Silva , T. RSDC'08: Tag Recommendations using Bookmark Content . In Proceedings of ECML PKDD Discovery Challenge 2008 (RSDC 2008 ), pages 96 - 107 .