<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discriminative Clustering for Content-Based Tag Recommendation in Social Bookmarking Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Malik Tahir Hassan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asim Karim</string-name>
          <email>akarimg@lums.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Suresh Manandhar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Cussens</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science LUMS School of Science and Engineering Lahore</institution>
          ,
          <country country="PK">Pakistan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Computer Science The University of York York</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe and evaluate a discriminative clustering approach for content-based tag recommendation in social bookmarking systems. Our approach uses a novel and efficient discriminative clustering method that groups posts based on the textual contents of the posts. The method also generates a ranked list of discriminating terms for each cluster. We apply the clustering method to build two clustering models - one based on the tags assigned to posts and the other based on the content terms of posts. Given a new posting, a ranked list of tags and content terms is determined from the clustering models. The final tag recommendation is based on these ranked lists. If the poster's tagging history is available then this is also utilized in the final tag recommendation. The approach is evaluated on data from BibSonomy, a social bookmarking system. Prediction results show that the tag-based clustering model is more accurate than the termbased clustering model. Combining the predictions from both models is better than either model's predictions. Significant improvement in recommendation is obtained over the baseline method of recommending the most frequent tags for all posts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Social bookmarking systems have become popular in recent years for organizing and
sharing resources on the Web. Such systems allow users to build a database of resources,
typically Web pages and publications, by adding basic information (such as URLs and
titles) about them and by assigning one or more keywords or tags describing them.
The tags serve to organize the resources and help improve recall in searches. Individual
users’ databases are shared among all users of the system enabling the development
of an information repository which is commonly referred to as a folksonomy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A
folksonomy is a collection of users, resources, and tags assigned by a user to a resource
posted by him or her. Tag recommendation for new posts by users is desirable for two
reasons. First, it ensures uniformity of tagging enabling better searches, and second,
it eases the task of users in selecting the most descriptive keywords for tagging the
resource.
      </p>
      <p>Tag recommendation can have one of two goals: (1) to suggest tags tailored to
individual users’ preferences (the ‘local’ goal). and (2) to suggest tags that promote
uniformity in tagging of resources (the ‘global’ goal). Tag recommendation can benefit from
the tagging history of users and resources. However, when a user posts for the first time
and/or the posted resource is new this historical information is less useful. In such cases,
content-based tag recommendation is necessary, in which the contents of the resource
are relied upon for tag recommendation.</p>
      <p>
        This paper addresses task 1 of the ECML PKDD Discovery Challenge 2009 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
This task deals with content-based tag recommendation in BibSonomy, a social
bookmarking system. The goal of tag recommendation is ‘local’, that is, to suggest tags
tailored to individual users’ preferences. Historical data of users, resources, and tags is
available; however, the tag recommendation system must be able to provide good
recommendations for unseen users and/or resources. Thus, the contents of resources must
be utilized for tag recommendation.
      </p>
      <p>Our solution to task 1 of the ECML PKDD Discovery Challenge 2009 relies on a
novel discriminative clustering and term ranking method for textual data. We cluster the
historical data of posted resources and develop a ranked list of discriminating tags and
content terms for each cluster. Given a new posting, based on its contents, we find the
best 3 clusters and develop a weighted list of tags and terms appropriate for tagging the
post. If the poster’s tagging history is available, then this provides a third ranked list of
tags appropriate for the post. The final tag recommendation for the post is done by rules
that select terms from the weighted lists. These rules also decide on the number of tags
to recommend for each known poster. Extensive performance results are presented for
the post-core training data provided by the challenge organizers.</p>
      <p>The rest of the paper is organized as follows. We present the related work and
motivation in Section 2. Section 3 presents details of our content-based tag
recommendation approach, including description of the discriminative clustering and method. Data
preprocessing and analysis is discussed in Section 4. The results of our approach are
presented and discussed in Section 5. We conclude in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work and Motivation</title>
      <p>Tagging resources with one or more words or terms is a common way of organizing,
sharing, and indexing information. Tagging has been popularized by Web applications
like image (e.g flickr), video (e.g. YouTube), bookmark (e.g. dec.icio.us), and
publication (e.g. BibSonomy) sharing/organizing systems. Automatic tag recommendation for
these applications can improve the organization of the information through ‘purposeful’
tag recommendations. Moreover, automatic tag recommendations ease the task of users
while posting new resources.</p>
      <p>
        The majority of the approaches proposed for tag recommendation assume that either
the user posting the resource and/or the resource itself has been seen in the historical
data available to the system [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3–6</xref>
        ]. If this is not the case, then only the contents of the
posted resource can be relied upon. For social bookmarking systems, contents of
resources are textual in nature requiring appropriate text and natural language processing
techniques.
      </p>
      <p>
        Content-based tag recommenders for social bookmarking systems have been
proposed by [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Lipczak’s method extracts the terms in the title of a post, expands this
set by using a tag co-occurrence database, and then filters the result by the poster’s
tagging history [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. He reports significant improvements in performance after each step of
this three step process. Tatu et al.’s method utilizes terms from several fields including
URL and title to build post and user based models [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] . It relies on natural language
processing to normalize terms from various sets before recommending them. We use
terms from several fields of the posts including URL and title. We also study the impact
of filling in missing and augmenting fields from information crawled from the Web.
      </p>
      <p>
        A key challenge in tag recommendation is dealing with sparsity of information. In
a typical collaborative tagging system, the vast majority of tags are used very
infrequently making learning tagging behavior very difficult. This issue is often sidestepped
in evaluation of tag recommenders when they are evaluated on post-core data with a
high level of duplication (e.g. in [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ] post-core at level 5 is used). Our evaluation is
done on post-core at level 2 data provided by the ECML PKDD Discovery Challenge
2009 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Document clustering has been used extensively for organizing and summarizing
large document collections [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. A useful characteristic of clustering is that it can
handle sparse document spaces by identifying cohesive groups. However, clustering is
generally computationally expensive. In the domain of collaborative tagging systems,
clustering has been explored for information retrieval and post recommendation [
        <xref ref-type="bibr" rid="ref11 ref12">11,
12</xref>
        ]. In this paper, we explore the use of clustering for content-based tag
recommendation. We use an efficient method that is practical for large data sets.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Discriminative Clustering for Content Based Tag</title>
    </sec>
    <sec id="sec-4">
      <title>Recommendation</title>
      <p>Our approach for content-based tag recommendation in social bookmarking systems is
based on discriminative clustering, content terms and tags rankings, and rules for
final recommendations. We use a novel and efficient discriminative clustering method to
group posts based on the tags assigned to them and based on their contents’ terms. This
method maximizes the sum of the discrimination information provided by posts and
outputs a weighted list of discriminating tags and terms for each cluster. We also
maintain a ranked list of tags for seen users. Tags are suggested from these three rankings by
intuitive rules that fuse the information from the lists. The rest of this section presents
our approach in detail.
3.1</p>
      <sec id="sec-4-1">
        <title>Problem Definition and Notation</title>
        <p>
          A social bookmarking system, such as BibSonomy [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], allows users to post and tag two
kind of resources: Web bookmarks and publications. Each resource type is described by
a fixed set of textual fields. A bookmark is described by fields like URL, title, and
description, while a publication is described by fields in the standard bibtex record. Some
of these fields (like title for bookmarks) are mandatory while others are optional. This
textual information forms the content of the resource. Each user who posts a resource
must also assign one or more tags for describing the resource.
        </p>
        <p>Let pi = fui; xi; tig denotes the ith post, where ui is the unique user/poster ID,
and xi and ti are the vector space representations of the post’s contents and tags,
respectively. If T is the size of the vocabulary then the ith post’s contents and tags can
be written as xi = fxi1; xi2; : : : ; xiT g and ti = fti1; ti2; : : : ; tiT g, respectively, where
xij (tij ) denotes the frequency of term j (tag j) in post i. Note that an identical vector
space model is used to represent both content terms and tags, tij 2 f0; 1g; 8i; j, and
xij ¸ 0; 8i; j. The historical data contain N posts. The tag recommender suggests tags
for a new post i described by ui and xi. The user ui and resource described by content
xi may or may not appear in the historical data.</p>
        <p>Let T G(i), T M (i), and T U (i) be the ranked list of tags from clustering, terms
from clustering, and user tags, respectively, corresponding to the ith post. The actual
tags recommended for post i, denoted by T R(i), are determined from these ranked lists
by intuitive rules.</p>
        <p>Given a test data containing M posts, the performance of the tag recommender is
evaluated by averaging F1-score of each prediction over the entire test data.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Discriminative Clustering for Tag and Term Ranking</title>
        <p>
          The historical data of N posts is clustered into K ¿ N groups using a novel
discriminative clustering method. This method is motivated from the recently proposed
DTWC algorithm for text classification [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. It is an iterative partitioning method that
maximizes the sum of discrimination information provided by each textual content (a
post, in our setting) between its assigned cluster and the remaining clusters. The key
ideas include discriminative term weighting, discrimination information pooling, and
discriminative assignment. Unlike other partitioning clustering methods, this method
does not require the explicit definition of a similarity measure and a cluster
representative. Furthermore, it builds a ranked list of discriminating terms for each cluster
implicitly. The method is computationally more efficient than popular methods like the
k-means clustering algorithm. We perform two clusterings of the historical data – one
based on the content terms x and the other based on the tags t of the posts in the data.
In the following description, we develop the method for content terms only; the method
as applied to tags will be similar.
        </p>
        <p>
          First, an initial clustering of the data is done. This can be done randomly or, less
efficiently especially for large collections, by a single iteration of the k-means algorithm
with the cosine similarity measure. Given this clustering, a discriminative term weight
wjk is computed for each term j in the vocabulary and for each cluster k as [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
wjk = ½ p(xj jk)=p(xj j:k) when p(xj jk) &gt; p(xj j:k)
        </p>
        <p>p(xj j:k)=p(xj jk) otherwise
where p(xj jk) and p(xj j:k) are the probabilities that term j belongs to cluster k and
the remaining clusters (:k), respectively. The discriminative term weight quantifies
the discrimination information that term j provides for cluster k over the remaining
clusters. Note that this weight is expressed as a probability ratio and is always greater
than or equal to 1. The probabilities are computed by maximum likelihood estimation
from the historical data.</p>
        <p>
          Having computed the discriminative term weights for the current clustering, two
discrimination scores can be computed for each post i. One score, denoted as Scorek(xi),
expresses the discrimination information provided by post i for cluster k, whereas the
other score, denoted as Score:k(xi), expresses the discrimination information
provided by post i for clusters :k. These scores are computed by linearly pooling the
discrimination information provided by each term xj in post i as [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
        </p>
        <p>Scorek(xi) =
Score:k(xi) =</p>
        <p>P
j2Zk xj wjk and</p>
        <p>Pj xj
Pj2Z:k xj wjk</p>
        <p>Pj xj
In these equations, Zk = fjjp(xj jk) &gt; p(xj j:k)g and Z:k = fjjp(xj j:k) &gt;
p(xj jk)g are sets of term indices that vouch for clusters k and :k, respectively. Each
post, described by its contents x, is then reassigned to the cluster k for which the cluster
score f k = Scorek(x) ¡ Score:k(x) is maximum. This is the cluster that makes each
post most discriminating among all the clusters.</p>
        <p>The overall clustering objective is to maximize the sum of discrimination
information, or cluster scores, of all posts. Mathematically, this is written as</p>
        <p>N K
Maximize J = X X Ik(xi) ¢ f k
where Ik(xi) = 1 if post i is assigned to cluster k and zero otherwise. Iterative
reassignment is continued until the change in the clustering objective becomes less than a
specified small value. Typically, the method converges satisfactorily in fewer than 15
iterations.</p>
        <p>The discriminative term weights for the terms in the index set Zk are ranked to
obtain the weighted and ranked list of terms for cluster k. As mentioned earlier, clustering
is also performed based on the tags assigned to posts. This clustering yields another
weighted and ranked list of tags for each cluster.</p>
        <p>It is worthwhile to point out that the term-based clustering is done on both the
training and testing data sets. This approach allows the terms that exist only in the
test data to be included in the vocabulary space, and for such terms to be available for
recommendation as tags.</p>
        <p>Given a new post i described by xi, the best cluster for it is the cluster k for which
the cluster score f k is a maximum. The corresponding ranked list of terms and tags for
post i are denoted by T M (i) and T G(i), respectively. These ranked lists contain the
most discriminating tags for post i based on its contents.
Given a new post, and based on the contents x of the post, two ranked lists of terms
appropriate for tagging are generated by the procedures described in the previous section.
If the user of the post appears in the historical data, then an additional list of potential
tags can be generated. This is the ranked list of tags T U (i) used by the user of post i
The ranking is done based on frequency. Moreover, the average number of tags per user
is computed and used while recommending tags for seen users.</p>
        <p>The final list of tags for post i is made by simple and intuitive rules that combine
information from all the lists. Let S be the number of tags to recommend for post i.
Then, the final list of tags for the post is given by the following algorithm:</p>
        <p>T R(i) = T G(i)[1 : P ] \ T M (i)[1 : Q]</p>
        <p>IF T U (i)j 6= ® THEN T R(i) = T R(i) \ T U (i)[1 : R]</p>
        <p>IF jT R(i)j &lt; S THEN add top terms from T G(i); T M (i)inT R(i)
In the above algorithm, P , Q, and R are integer parameters that define how many top
terms to include from each list. If after taking the set intersections jT R(i)j &lt; S then
the remaining tags are obtained from the top tags and terms in T G(i) and T M (i),
respectively. In general, as seen from our evaluations, R · Q · P , indicating that
T G(i) is the least noisy source and T U (i) the most noisy source for tags.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation Setup</title>
      <sec id="sec-5-1">
        <title>Data and their Characteristics</title>
        <p>
          We evaluate our approach on data sets made available by the ECML PKDD Discovery
Challenge 2009 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. These data sets are obtained from dumps of public bookmark and
publication posts on BibSonomy [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The dumps are cleaned by removing spammers’
posts and posts from the user dblp (a mirror of the DBLP Computer Science
Bibliography). Furthermore, all characters from tags that are neither numbers nor letters are
removed. UTF-8 encoding and unicode normalization to normal form KC are also
performed.
        </p>
        <p>The post-core at level 2 data is obtained from the cleaned dump (until 31 December
2008) and contain all posts whose user, resource, and tags appear in at least one more
post in the post-core data. The post-core at level 2 contain 64,120 posts (41,268
bookmarks and 22,852 publications), 1,185 distinct users, and 13,276 distinct tags. We use
the first 57,000 posts (in content ID order) for training and the remaining 7,120 posts
for testing.</p>
        <p>We also present results on the test data released as part of task 1 of the ECML
PKDD Discovery Challenge 2009. This data is cleaned and processed as described
above, but it contain only those posts whose user, resource, or tags do not appear in the
post-core at level 2 data. This data contain 43,002 posts (16,898 bookmarks and 26,104
publications) and 1,591 distinct users. For this evaluation, we use the entire 64,120 posts
in the post-core at level 2 for training and test on the 43,002 posts in the test data.</p>
        <p>These data sets are available in the form of 3 tables – tas, bookmark, and bibtex –
as described below. The content of a post is defined by the fields in the bookmark and
bibtex tables, while the tags appear in the tas table.
tas fact table; who attached which tag to which post/content. Fields include: user
(number; user names are anonymized), tag, content id (matches bookmark.content id or
bibtex.content id), content type (1 = bookmark, 2 = bibtex), date
bookmark dimension table for bookmark data. Fields include: content id (matches
tas.content id), url hash (the URL as md5 hash), url, description, extended
description, date
bibtex dimension table for BibTeX data. Fields include: content id (matches tas.content
id), journal, volume, chapter, edition, month, day, booktitle, howPublished,
institution, organization, publisher, address, school, series, bibtexKey (the bibtex key
(in the @... line)), url, type, description, annote, note, pages, bKey (the “key”
field), number, crossref, misc, bibtexAbstract, simhash0 (hash for duplicate
detection within a user – strict – (obsolete)), simhash1 (hash for duplicate detection
among users – sloppy –), simhash2 (hash for duplicate detection within a user –
strict –), entrytype, title, author, editor, year</p>
        <p>A few tagging statistics from the post-core data are given in Table 1 and Figure
1. These statistics are used to fix the parameter S (number of recommended tags) for
known users. For unseen users, S is set at 5.
We explore tag recommendation performance on original contents, contents that have
been augmented by crawled information, and contents that have been augmented and
lemmatized.</p>
        <p>The vocabulary for the vector space representation is formed from the tags and
content terms in the training and testing sets. Selected content fields are used for gathering
the content terms. For bookmark posts, the selected fields are url, description, and
extended. For publication posts, the selected bibtex fields are booktitle, journal,
howpublished, publisher, series, bibtexkey, url, description, annote, note, bkey, crossref, misc,
bibtexAbstract, entrytype, title, and author. As mentioned earlier, the tags, which appear
in the tas table, are also included in the vocabulary.</p>
        <p>We remove all the non-letter and non-digit characters, but retain umlauts and other
non-Latin characters due to UTF-8 encoding. All processed terms of length greater than
or equal to three are retained. The tags are processed similarly, but without considering
the token length constraint.
Crawling Crawling is done to fill in and augment important fields. For bookmark posts,
the extended description field is appended with textual information from &lt;TITLE&gt;,
&lt;H1&gt; and &lt;H2&gt; HTML fields of the URL provided in the posts.</p>
        <p>
          For publication posts, missing abstract field are filled using online search. We use
the publication title to search for its abstract on CiteULike [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. If the article is found,
and its abstract is available on CiteULike, the bibtexAbstract field of the post is
updated. CiteULike is selected because its structure is simpler and it does not have any
restrictions on the number of queries (in a day for example).
        </p>
        <p>
          Lemmatization We also explore lemmatization of the vocabulary while developing
the vector space representation. Lemmatization is different from stemming as
lemmatization returns the base form of a word rather than truncating it. We do lemmatization
using TreeTagger [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. TreeTagger is capable of handling multiple languages besides
English. We lemmatize the vocabulary using English, French, German, Italian, Spanish
and Dutch languages. The procedure, in brief, is as below:
1. TreeTagger is run on the vocabulary file once for each language: English, French,
        </p>
        <p>German, Italian, Spanish and Dutch.
2. TreeTagger returns the output file containing token, pos, lemma. The lemma is
“&lt;unknown&gt;” if a token is not recognized in that language.
3. Using this “&lt;unknown&gt;” word, we combine the output of all six lemmatized files.</p>
        <p>If a term is not recognized by any language, the term itself is used as lemma.
4.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Evaluation Criteria</title>
        <p>The performance of tag recommendation systems is typically evaluated using precision,
recall, and F1 score, where the F1 score is a single value obtained by combining both
precision and recall. We report the precision, recall, and F1 score averaged over all the
posts in the testing set.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>In this section, we present and discuss the results of our discriminative clustering
approach for content based tag recommendation. We start off by evaluating the
performance of the clustering method.
5.1</p>
      <sec id="sec-6-1">
        <title>Clustering Performance</title>
        <p>The performance of the discriminative clustering method is evaluated on the entire
64,120 posts of the post-core at level 2 data. We cluster these posts based on the tags
assigned to them. After clustering and ranking of tags for each cluster, we recommend
the top 5 tags from the ranked list for all posts in each cluster. The average precision,
recall, and F1 score percentages obtained for different values of K (number of desired
clusters) is shown in Table 2.</p>
        <p>The top 5 tags become increasingly accurate recommendations as the number of
clusters is increased, with the maximum recall of 48.7% and F1 score of 30.6%
obtained when K = 300. These results simulate the scenario when the entire tag space
(containing 13,276 tags) is known. Furthermore, there is no separation between
training and testing data. Nonetheless, the results do highlight the worth of clustering in
grouping related posts that can be tagged similarly.</p>
        <p>Table 3 shows the top ranked tags for selected clusters. It is seen that the
discriminative clustering method is capable of grouping posts and identifying descriptive tags for
/ No. Top Discriminating Tags
1 svm, ki2007webmining, mining, kernels, textmining, dm, textclassification
2 windows, freeware, utility, download, utilities, win, shareware
3 fun, flash, games, game, microfiction, flashfiction, sudden
4 tag, cloud, tagcloud, tags, folksonomia, tagging, vortragmnchen2008
5 library, books, archive, bibliothek, catalog, digital, opac
6 voip, mobile, skype, phone, im, messaging, hones
7 rss, feeds, aggregator, feed, atom, syndication, opml
8 bookmarks, bookmark, tags, bookmarking, delicious, diigo, socialbookmarking
each group of posts. Noisy tags are not ranked high in the lists. It is even able to
discriminate and group posts of different languages (not shown in this table), especially when
clustering is based on content terms. Two valuable characteristics of the discriminative
clustering method are its stability and efficiency. The method converges smoothly
(Figure 2) usually within 15 iteration. More importantly, especially considering the large
post by vocabulary sizes involved, is the efficiency of the method. Each iteration of the
method completes within 3 minutes, even for the large 107; 122 £ 317; 283 data for the
content-based clustering of the post-core plus task 1 test data.
In this section, we discuss the performance of recommending the top 5 tags from the
T G(i) or T M (i) list of each post i. This evaluation is done on the testing data of 7,120
posts held out from the post-core at level 2 data. The clustering model is based on the
first 57,000 posts (in content ID order) from the data. In this evaluation, the original
16 x 104
14
J12
,
e
itv10
c
e
j
b
O 8
g
n
i
tre 6
s
u
lC 4
2
00</p>
        <p>K = 10
K = 50
K = 100
K = 200
K = 300
5
10</p>
        <p>15</p>
        <p>No. of Iterations
data, without augmentation with crawled information, is used for creating the vector
space representation.</p>
        <p>The recommendation results for different K values are given in Table 4. Results are
shown for the case when only the top cluster for each post is considered, and for the
case when the top three clusters of each post are merged in a weighted manner (using
cluster score and discriminative term weights). It is observed that merging the lists of the
top three clusters always gives better performance. Moreover, recommendations based
on T G(i) are always better than those based on T M (i) indicating that the term-based
clustering is more noisy than that based on tags. We also find out that K = 200 yields
the highest recommendation performances.
In this section, we evaluate the performance of our approach when utilizing information
from all lists. We also evaluate performance on original, crawled, and crawled plus
lemmatized data. These results are shown in Table 5. For this evaluation, we fix K =
200 and use the top three clusters for building T G(i) and T M (i).</p>
        <p>The first column (identified by the heading TF) shows the baseline result of
recommending the top 5 most frequent tags in the training data (57,000 posts from post-core
data). It is seen that our clustering based recommendation improves performance
beyond the baseline performance. The second and third columns show the performance of
recommending the top 5 terms from T G(i) and T M (i), respectively. The predictions
of the tag-based clustering always outperform the predictions of the term-based
clustering. In the fourth column, we report results for the case when the top 5 recommended
tags are obtained by combining T G(i) and T M (i), as described in Section 3.3. These
results are significantly better than those produced by each list independently.</p>
        <p>The fifth column shows the results of combining all lists, including the user list
T U (i) when known. This strategy produces the best F1 score of 15.5% for the crawled
data. This is a significant improvement over the baseline F1 score of 7.0%.</p>
        <p>Table 5 also shows that filling in missing fields and augmenting the fields with
crawled information improves performance. Lemmatization does not help, probably
because users do not necessarily assign base forms of words as tags.
We report the performance of our approach on task 1 test data released by the
challenge organizers on the bottom line of Table 5. We filled in missing and augmented
other fields by crawled information. No lemmatization is done. The final vocabulary
size is equal to 317,283 terms making the tag recommendation problem very sparse.
The baseline performance of using the 5 most frequent tags from the post-core at level
2 (the training data for this evaluation) is the F1 score of 1.1% only. By using our
discriminative clustering approach, the average F1 score reaches up to 5.4%. This low
value is attributable to the sparseness of the data, and it is unlikely that other methods
can cope better without extensive semantic normalization and micro modeling of the
tagging process.
6</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this paper, we explore a discriminative clustering approach for content-based tag
recommendation in social bookmarking systems. We perform two clusterings of the posts:
one based on the tags assigned to the posts and the second based on the content terms of
the posts. The clustering method produces ranked lists of tags and terms for each cluster.
The final recommendation is done by using both lists, together with the user’s tagging
history if available. Our approach produces significantly better recommendations than
the baseline recommendation of most frequent tags.</p>
      <p>In the future, we would like to explore language specific models, incorporation of a
tag extractor method, and semantic relatedness and normalization.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Golder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huberman</surname>
            ,
            <given-names>B.A.</given-names>
          </string-name>
          :
          <article-title>The structure of collaborative tagging systems</article-title>
          . http://arxiv.org/abs/cs.DL/0508082 (Aug
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Eisterlehner</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Ja¨schke, R.:
          <article-title>Ecml pkdd discovery challenge 2009</article-title>
          . http://www.kde.cs.uni-kassel.de/ws/dc09 (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Sigurbjo¨rnsson, B.,
          <string-name>
            <surname>van</surname>
            <given-names>Zwol</given-names>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          :
          <article-title>Flickr tag recommendation based on collective knowledge</article-title>
          .
          <source>In: WWW '08: Proceeding of the 17th international conference on World Wide Web</source>
          , New York, NY, USA, ACM (
          <year>2008</year>
          )
          <fpage>327</fpage>
          -
          <lpage>336</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Ja¨schke, R.,
          <string-name>
            <surname>Marinho</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt-Thieme</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stumme</surname>
          </string-name>
          , G.:
          <article-title>Tag recommendations in folksonomies</article-title>
          . (
          <year>2007</year>
          )
          <fpage>506</fpage>
          -
          <lpage>514</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jschke</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marinho</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt-Thieme</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stumme</surname>
          </string-name>
          , G.:
          <article-title>Tag recommendations in social bookmarking systems</article-title>
          .
          <source>AI Communications</source>
          <volume>21</volume>
          (
          <issue>4</issue>
          ) (
          <year>2008</year>
          )
          <fpage>231</fpage>
          -
          <lpage>247</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Symeonidis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nanopoulos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manolopoulos</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Tag recommendations based on tensor dimensionality reduction</article-title>
          .
          <source>In: RecSys '08: Proceedings of the 2008 ACM conference on Recommender systems</source>
          , New York, NY, USA, ACM (
          <year>2008</year>
          )
          <fpage>43</fpage>
          -
          <lpage>50</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lipczak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Tag recommendation for folksonomies oriented towards individual users</article-title>
          .
          <source>In: ECML PKDD Discovery Challenge</source>
          <year>2008</year>
          . (
          <year>2008</year>
          )
          <fpage>84</fpage>
          -
          <lpage>95</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Tatu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srikanth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D'Silva</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Rsdc 08: Tag recommendation using bookmark content</article-title>
          .
          <source>In: ECML PKDD Discovery Challenge</source>
          <year>2008</year>
          . (
          <year>2008</year>
          )
          <fpage>96</fpage>
          -
          <lpage>107</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Andrews</surname>
            ,
            <given-names>N.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          :
          <article-title>Recent developments in document clustering</article-title>
          .
          <source>Technical report</source>
          , Computer Science, Virginia
          <string-name>
            <surname>Tech</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kogan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicholas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teboulle</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A Survey of Clustering Data Mining Techniques</article-title>
          .
          <source>In: Grouping Multidimensional Data</source>
          . Springer (
          <year>2006</year>
          )
          <fpage>25</fpage>
          -
          <lpage>71</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Begelman</surname>
          </string-name>
          , G.:
          <article-title>Automated tag clustering: Improving search and exploration in the tag space</article-title>
          .
          <source>In: In Proc. of the Collaborative Web Tagging Workshop at WWW06</source>
          . (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Shepitsen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gemmell</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mobasher</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burke</surname>
          </string-name>
          , R.:
          <article-title>Personalized recommendation in social tagging systems using hierarchical clustering</article-title>
          .
          <source>In: RecSys '08: Proceedings of the 2008 ACM conference on Recommender systems</source>
          , New York, NY, USA, ACM (
          <year>2008</year>
          )
          <fpage>259</fpage>
          -
          <lpage>266</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. BibSonomy: Bibsonomy:
          <article-title>A blue social bookmark and publication sharing system</article-title>
          . http://www.bibsonomy.org/ (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Junejo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karim</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A robust discriminative term weighting based linear discriminant method for text classification</article-title>
          .
          <source>In: Data Mining</source>
          ,
          <year>2008</year>
          . ICDM '08. Eighth IEEE International Conference on.
          <source>(Dec</source>
          .
          <year>2008</year>
          )
          <fpage>323</fpage>
          -
          <lpage>332</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. CiteULike:
          <article-title>Citeulike website</article-title>
          . http://www.citeulike.org/ (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. TreeTagger:
          <article-title>Treetagger - a language independent part-of-speech tagger</article-title>
          . http://www.ims.unistuttgart.de/projekte/corplex/TreeTagger/ (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>