<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Tag Recommendation System based on contents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ning Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuan Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jie Tang</string-name>
          <email>jietang@tsinghua.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowledge Engineering Group Department of Computer Science and Technology, Tsinghua University</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Social bookmarking tools become more and more popular nowadays and tagging is used to organize information and allow users to recall or search the resources. Users need to type the tags whenever they post a resource, so that a good tag recommendation system can ease the process of finding some useful and relevant keywords for users. Researchers have made lots of relevant work for recommendation system, but those traditional collaborative systems do not fit to our tag recommendation. In this paper, we present two different methods: a simple language model and an adaption of topic model. We evaluate and compare these two approaches and show that a combination of these two methods will perform better results for the task one of PKDD Challenge 2009.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>With the event of social resource sharing tools, such as BibSonomy 1 , Flickr 2 ,
del.icio.us3, tagging has become a popular way to organize the information and help
users to find other users with similar interests and useful information within a given
category. Tags posted by a user are not only relevant to the content of the bookmark
but also to the certain user. According to [3], the collection of a user’s tag
assignments is his/her personomy, and folksonomy consists of collections of
personomies. From the available training data, we can find that some tags might just
be words extracted from the title, some tags might be the concept or main topic of the
resource, and other might be very specific to a user.(see Table 1). The last three lines
of the table show that the user 293 post tags like swss0603, swss0609, swss0602,
which are very specific to the user. Since the test data for task one contains posts
whose user, resource or tags are not in the training data, some traditional collaborative
recommendation systems might not perform well. It is because most of the
collaborative recommendation systems cannot recommend tags which are not in the
tag set of the training data. This paper presents our tag recommendation system,</p>
      <sec id="sec-1-1">
        <title>1 http://www.bibsonomy.org 2 http://www.flick.com 3 http://del.icio.us</title>
        <p>which is a combination of two methods: simple Language model and an adaption of
topic model according to [7].</p>
      </sec>
      <sec id="sec-1-2">
        <title>USER</title>
      </sec>
      <sec id="sec-1-3">
        <title>TITLE</title>
      </sec>
      <sec id="sec-1-4">
        <title>TAGS</title>
        <p>adobe, svg
api java, delicious
psycholinguistics,
review,
workingmemory</p>
      </sec>
      <sec id="sec-1-5">
        <title>SOURCE OF TAGS title title</title>
        <p>concept or
topic
specific to
the user
specific to
the user
specific to
the user</p>
      </sec>
      <sec id="sec-1-6">
        <title>SVG: Adobe</title>
        <p>SourceForge.net:
deliciousjava</p>
        <p>Reassessing Working
Memory: Comment on Just
and Carpenter (1992) and
Waters and Caplan (1996)
A Semantic Web Primer</p>
      </sec>
      <sec id="sec-1-7">
        <title>The ABCDE Format Enabling</title>
        <p>Semantic Conference</p>
        <p>Proceedings
Learning of Ontologies for the
Web: the Analysis of Existent</p>
        <p>Approaches</p>
        <p>The first method can extract some keywords from the description and other
information of the post and they constitute a candidate set. Then we use the relevance
of a word and a document to score the words in the candidate set and recommend the
words with highest scores. The second method uses an ACT model [7], and it can get
some conceptual or topic knowledge of the post. Given a test post, the model can
score all the tags which have been posted previously and recommend the tags with
highest scores.</p>
        <p>These two methods focus on two different aspects. The first method will probably
recommend some keywords extracted from the title while the second method uses a
probabilistic latent semantic method to recommend tags which are similar to the post
in terms of conceptual knowledge. Comparing these two methods, we can find that
the tags recommended are always different. Consequently, the combination is a
intuitive better way.</p>
        <p>This paper is organized as follows: Section 2 reviews recent development in the
area of social bookmark tag recommendation systems. Section 3 describes our
proposed system and the combination method in details. In section 4, we present and
evaluate our experimental results on the test data of ECML PKDD challenge and
conclude the results in section 5.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>The recent rise of Web 2.0 technologies has aroused the interest of many researchers
to the tag recommendation system. Some approaches are based on collaborative
information. For example, AutoTag[9] and TagAssist[8] use some information
retrieval skills to recommend tags for weblog posts. They recommend tags based on
the tags posted to the similar weblogs and they cannot recommend new tags which are
not in the training file. FolkRank[4,5], which is an adaption of the famous PageRank
algorithm, is a graph-based recommendation system. Also, it cannot recommend new
tags not in the training file. The experimental results of FolkRank in [5] reveal that
the FolkRank can outperform other collaborative methods. But to some extents
FolkRank relies on a dense core of training file and it might not be fit to our task.</p>
      <p>All the methods mentioned above are based on collaborative information and
similarity between users and resources. However, in the cases when there are many
new users and resources in the test data (our task one), those methods cannot perform
well. In the RSDC ’08 challenge, the participants[1,2] who use methods based on
words extracted from the title or semantic knowledge and user’s personomy can
outperform other methods. Consequently, we propose our tag recommendation system
mainly based on the contents.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Our Tag Recommendation System</title>
      <sec id="sec-3-1">
        <title>Notations</title>
        <p>First, we define notations used in this paper. We group the data in bookmark by its
url_hash and data in bibtex by its simhash1. If some posts in bookmark or bibtex file
have the same url_hash or simhash1, they are mapped to one resource r. In bookmark,
we extract description, extended description while in bibtex, we extract journal,
booktitle, description, bibtexAbstract, title and author. We define these information as
the description of resource r. For each resource r and each user u who has posted tags
to resource r, assuming that its description contains a vector  d of Nd words; a vector
 d of Td tags posted to this resource r by the user ud .Then the training dataset can be
represented as</p>
        <p>D  {(w1, t1, u1),..., (wD , tD , uD )}
Language model is widely used in natural language processing applications such as
speech recognition, machine translation and information retrieval. In our model, first
we pick some words to form a candidate set of recommended tags and then score all
the words in the candidate set and recommend words with highest scores for our tag
where Nd is the number of word tokens in the description d of r*, tf(t1,d) is the word
frequency(i.e., occurring number) of word t1 in the description of d of r∗, ND′ is the
number of word tokens in the entire test dataset, and tf(t1,D’) is the word frequency of
word t1 in the collection D’. λ is the Dirichlet smoothing factor and is commonly set
according to the average document length, i.e. ND′ /|D′| in our cases.</p>
        <p>As for C2, in order to get more information about the new resource, we take the
similarity between resources into consideration and add tags previously posted to the
similar resource into C2. The similarity of resource is determined by the url of the
resource. Each url can be split into several sections, for example,
‘http://www.kde.cs.uni-kassel.de/ws/dc09’ will be split into three sections:
‘www.kde.cs.uni-kassel.de’, ‘ws’ and ‘dc09’. The similarity between r1 and r2 is
defined as follows, sim(r1,r2) = 2^(number of the identical sections of url1 and url2 –
maximum number of sections of url1 and url2). For each resource r, we will choose
three most similar urls to the url of r and their corresponding resources form the
neighbor of the resource r, noted as neigh(r). C2 is compose of the tags previous
posted to the neigh(r*). For each t2 ∈ C2 , we have the following generative
probability:
P2 (t2 | r* )  rneigh(r* ) ( Tr
 Tr  </p>
        <p>n(t2 , r)
  (1 </p>
        <p>Tr</p>
        <p>Tr
T  
r
)  n(|tT2, R|) )  sim(r, r* ) (2)
where n(t2,r) is the number of times that tag t2 has been posted to the resource r and
n(t2,R) is the number of times that tag t2 has been posted in the training data. Tr is the
number of tags posted to the resource r and λ is the Dirichlet smoothing factor and is
commonly set according to the average document length, i.e. T /|R|</p>
        <p>Now, we have the definition of the candidate set C = C1 ∪ C2 and for an active
resource r*, we have P1(t1|r*) for t1 ∈ C1 and P2(t2|r*) for t2 ∈ C2 . Then the set of
recommended tags will be: T u, r ≔ argmaxtn∈C (P1(t|r*)+ P2(t|r*)) where n is the
number of recommended tags.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>ACT model</title>
        <p>The model we use is called Author-Conference-Topic (ACT) model[7], which is an
adaptation of topic model. The initial model was used to simultaneously modeling
papers, authors, and publication venues within a unified probabilistic topic model.
The model utilizes the topic distribution to represent the inter-dependencies among
authors, papers and publication venues. In our task, for each  d ,  d , ud ∈ D, we map
the  d to conference information, map  d to author of the paper, map ud to the
publication venue. Consequently, after modeling, we can get probabilistic relations
among description, tags and user given a post. In details, we can get these
interdependencies: P(z|t), P(w|z) and P(u|z), where z is a hidden topic layer in the topic
model. From this model, we can utilize the hidden topic layer to learn some
conceptual knowledge of the post. The number of topic z can be set manually.
After obtaining the probability of P(z|t), P(w|z) and P(u|z) from the model, we can
derive that for each word w ∈  d , each tag t ∈  d, we have:</p>
        <p>P(w | t)   P(w | z)P(z | t)</p>
        <p>z
P(u | t)   P(u | z)P(z | t)</p>
        <p>z</p>
        <p>To recommend tags for a given user u’ and a given resource r’, we will score all
the tags t ∈ T using probabilistic methods as follows:
P(t | u ', r ')  P(t, u ' r ')  P(t)P(u ' | t)P(r ' | t)  P(t)P(u ' | t)wr' P(w | t) (3)</p>
        <p>There are two things that needed to be mentioned here. First, if the given user
u′ ∉ U, which means the given user is a new user. Then we cannot obtain P(u’|t) in
the equation (3), in that case we will set the value to 1 and the equation (3) will
become:</p>
        <p>P(t)wr ' P(w | t)  P(t)P(r ' | t)  P(t | r ')
Because the user is a new one, so that we have nothing about his/her history of
posting tags in our training data, in that case equation (4) is reasonable. Secondly, not
all the words in the given resource are in our training data, so when calculating
equation (3), we will ignore the word which has not appeared in the training data.</p>
        <p>The set of recommended tags for a given user u’ and a given resource r’ will be:
T u′, r′ ≔ argmaxtn∈T P(t|u′, r′) where n is the number of recommended tags, T is
the collection of tags posted in the training file.
3.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Combination</title>
        <p>We have proposed two different methods to recommend tags, model one focuses on
the useful words extracted from the description or title of the resource while the
second model focuses on the conceptual knowledge and probabilistic relations among
tags, resource and users. We are interested in the following problem: Can we
combine these two models to perform a better result for tag recommendation?
Input: a given resource r and a given user u and the result of ATC model P(t),P(w|t)
and P(u|t) for all tags, users, words in the training file.</p>
        <p>Output: T(u, r)the set of recommended tags
begin
//Model one
  ← words in the candidate set C
foreach w ∈  1 do</p>
        <p>1  = P1 w r + P2 w r
end
max_score1 ← maxw∈w1score1[w]
//Model two
foreach t ∈ T do
score2 t = P t P(u|t)</p>
        <p>P(w|t)
w∈r
end
max_score2 ← maxt∈Tscore2[t]
//Combination
foreach t ∈   ∪ T do</p>
        <p>score[t] = score1[t]+score2[t]*max_score1/max_score2
end</p>
        <p>T u, r ≔ argmaxtn∈Tscore[t]
end</p>
        <sec id="sec-3-3-1">
          <title>Algorithm 1: The combined tag recommendation system</title>
          <p>We have tried some different approaches to combine these two models. A simple
method is to combine the scores of these two models and recommend tags with
highest scores after combination (Algorithm 1). We can make use of the two scores
calculated in the two approaches and there are two things worthy to be noted here: 1)
model one only calculates the scores of tags in the candidate set but model two
calculates all the tags t ∈ T. 2) due to the different distribution of scores, we need to
normalize the two scores before combination. In order to solve the first problem, we
consider all the tags t ∈ C ∪ T where C is the candidate used in the model one. In
terms of normalization, we make the score1 ∞ = score2 ∞ and then add these
two score, if a tag t is in the candidate set C but not in the T, the score2[t] = 0 and if a
tag t is in T but not in C, then the score1[t] =0.
4
4.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <sec id="sec-4-1">
        <title>Dataset</title>
        <p>We evaluate our experimental results using the evaluation methods provided by the
organizers of ECML PKDD discovery challenge 2009. The training set and the test
set are strictly divided and we use the cleaned dump as our training set for our tag
recommendation system.</p>
        <p>Here are some statistical information about training data and test data: There are
1,401,104 tag assignments. 263,004 bookmarks are posted and among which there are
235,328 different url resources while 158,924 bibtex are posted and among which
there are 143,050 different publications. From this, we can see that many resources
appear just once in the training file. There are 3,617 users and 93,756 tags in all in the
training file. The average number of tags posted to bookmark is 3.48 and the average
number of tags posted to bibtex is 3.05.</p>
        <p>In the test data, there are 43,002 tag assignments, 16,898 posted bookmarks and
26,104 posted bibtex. Among all the posts, there are only 1,693 bookmark resources
and 2,239 bibtex resources which are in the training file. The average number of tags
posted to bookmark is 3.81 and the average number of tags posted to bibtex is 3.82.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Data Preprocessing</title>
        <p>The training data is provided by the organizers of the ECML PKDD, we establish
three tables, bookmark, bibtex and tas in our MySQL database. In order to get the
similarity between resources, we need to preprocess the url field. For each url in the
bookmark, we eliminate the prefix such as ‘http://’, ‘https://’ and ‘ftp://’. Then we
split the url by the character ‘/’. For example, a url
‘http://www.kde.cs.unikassel.de/ws/dc09’ will be split into ‘www.kde.cs.uni-kassel.de’, ‘ws’ and ‘dc09’. As
we mentioned above, we define some information extracted from the table as the
description of a resource r. We eliminate the stop words in the English dictionary and
stem the words, for example, ‘knowledge’ will be stemmed to ‘knowledg’ and both
‘biology’ and ‘biologist’ will be stemmed to ‘biologi’.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3.1 Performance of Language model</title>
        <p>In table 3, we show the performance of Language model on the test data provided by
the organizers of ECML PKDD challenge 2009. We show the performance of
bookmark, bibtex and the whole data respectively. From the table, we can find that
the result of bookmark is better than that of bibtex and we can have a highest
fmeasure of 13.949% when the number of recommended tags is 10.
In table 4, we show the performance of Language model on the test data provided by
the organizers of ECML PKDD challenge 2009. We show the performance of
bookmark, bibtex and the whole data respectively. From the table, we can see that the
performance of ACT model is worse than the language model. We have the highest
fmeasure of 3.077% when the number of recommended tags is set to five. Also, the
performance of bookmark is a little bit better than the bibtex and the reason might be
that description in bibtex has some information which is irrelevant to the main topic
of the publication.
In table 5, we show the performance after the combination of these two models. From
the table, we can see that after combination, our recommendation system works a
little better than the Language model and has a highest f-measure of 14.398% when
recommending five tags.
The result after combination is shown in Fig. 1, together with the results of the
previous two methods.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper, we describe our tag recommendation system for the first task in the
ECML PKDD Challenge 2009. We exploit two different models to recommend tags.
The experimental results show that the Language model works much better than the
ACT model and the combination of these two methods can improve the results.
However, we have an unexpected result of the poor ACT model, from the previous
test using the separated test data from the training file, the ACT model can have a
fmeasure over 20%.</p>
      <p>We need to further analyze the results to see why ACT has a poor result for the test
data. Also, we can try to change the scoring scheme or expand the candidate set in
the language model. Future work also includes some new methods of combination.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Marta</given-names>
            <surname>Tatu</surname>
          </string-name>
          , Munirathnam Srikanth, and
          <string-name>
            <surname>Thomas D'Silva</surname>
          </string-name>
          . RSDC'
          <volume>08</volume>
          :
          <article-title>Tag Recommendations using Bookmark Content</article-title>
          .
          <source>In Proc. of ECML PKDD Discovery Challenge (RSDC 08)</source>
          , pp.
          <fpage>96</fpage>
          -
          <lpage>107</lpage>
          Marek Lipczak.
          <article-title>Tag Recommendation for Folksonomies Oriented towards Individual Users</article-title>
          .
          <source>In Proc. of ECML PKDD Discovery Challenge(RSDC 08)</source>
          , pp.
          <fpage>84</fpage>
          -
          <lpage>95</lpage>
          Andreas Hotho, Robert Jaschke, Chiristoph Schmitz, and Gerd Stumme.
          <article-title>BibSonomy: A Social Bookmark and Publication Sharing System</article-title>
          .
          <source>In Proc. the First Conceptual Structures Tool Interoperability Workshop at the 14th Int. Conf. on Conceptual Structures</source>
          , pages
          <fpage>87</fpage>
          -
          <lpage>102</lpage>
          , Aalborg,
          <year>2006</year>
          . Aalborg Universitetsforlag.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Hotho</surname>
          </string-name>
          , Robert Jaschke, Christoph Schmitz, and Gerd Stumme.
          <article-title>FolkRank: A Ranking Algorithm for Folksonomies</article-title>
          .
          <source>In Proc. of FGIR 2006 Robert Jaschke</source>
          , Leandro Marinho, Andreas Hotho, Lars Schmidt-Thieme, and
          <string-name>
            <given-names>Gred</given-names>
            <surname>Stumme</surname>
          </string-name>
          .
          <article-title>Tag Recommendations in Folksonomies</article-title>
          .
          <source>In Knowledge Discovery in Database: PKDD</source>
          <year>2007</year>
          ,
          <article-title>11th European Conference on Principles and Practice of Knowledge Discovery in Database</article-title>
          , Warswa, Poland,
          <source>September 17-21</source>
          ,
          <year>2007</year>
          , Proceedings, volume
          <volume>4702</volume>
          <source>of LNCS</source>
          , pages
          <fpage>506</fpage>
          -
          <lpage>514</lpage>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Mark</given-names>
            <surname>Steyvers</surname>
          </string-name>
          , Padhraic Smyth, and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Griffiths</surname>
          </string-name>
          .
          <article-title>Probabilistic Author-Topic Models for Information Discovery</article-title>
          .
          <source>In Proc. of KDD'04.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Jie</given-names>
            <surname>Tang</surname>
          </string-name>
          , Ruoming Jin,
          <string-name>
            <given-names>and Jing</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search</article-title>
          .
          <source>In Proceeding of 2008 IEEE international Conference on Data Mining</source>
          (ICDM'
          <year>2008</year>
          ). pp.
          <fpage>1055</fpage>
          -
          <lpage>1060</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Sanjay C.</given-names>
            <surname>Sood</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sara H.Owsley</surname>
            ,
            <given-names>Kristian J.</given-names>
          </string-name>
          <string-name>
            <surname>Hammond</surname>
          </string-name>
          , and
          <article-title>Larry Birnbaum TagAssist: Automatic tag suggestion for blog posts</article-title>
          .
          <source>In Proc. the International Conference on Weblogs and Social Media(ICWSM</source>
          <year>2007</year>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Gilad</given-names>
            <surname>Mishne</surname>
          </string-name>
          .
          <article-title>Autotag: a collaborative approach to automated tag assignment for weblog posts</article-title>
          .
          <source>In WWW'06: Proceedings of the 15th international conference on World Wide Web</source>
          , pages
          <fpage>953</fpage>
          -
          <lpage>954</lpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM Press.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>