Tracking Usage in Collaborative Tagging Communities
         Elizeu Santos-Neto                                  Matei Ripeanu                              Adriana Iamnitchi
  Electrical & Computer Engineering                Electrical & Computer Engineering           Computer Science and Engineering
    University of British Columbia                   University of British Columbia               University of South Florida
     2332 Mail Mall – KAIS 4075                       2356 Mail Mall – KAIS 4033                4202 E. Fowler Ave, Tampa, FL
           +1.604.827.4270                                  +1.604.822.7281                           +1.813.974.5357
    elizeus@ece.ubc.ca                                    matei@ece.ubc.ca                             anda@cse.usf.edu
ABSTRACT                                                                [16] report that in January 2006 Flickr congregated about one
Collaborative tagging has recently attracted the attention of both      million users. Similarly, del.icio.us reached one million users in
industry and academia due to the popularity of content-sharing          September 2006 [26].
systems such as CiteULike, del.icio.us, and Flickr. These systems       Although collaborative tagging is attracting increasing attention
give users the opportunity to add data items and to attach their        from both industry and academia, there are few studies that assess
own metadata (i.e., tags) to stored data. The result is an effective    the characteristics of communities of users who share and tag
content management tool for individual users. Recent studies,           content. In particular, little research has been done on the
however, suggest that, as tagging communities grow, the added           potential benefits of tracking usage patterns in collaborative
content and the metadata become harder to manage due to                 tagging communities. Moreover, recent investigations have shown
increased content diversity. Thus, mechanisms that cope with            that, as the user population grows, the efficiency of information
increase of diversity are fundamental to improve the scalability        retrieval based on user generated tags tends to decrease [2].
and usability of collaborative tagging systems.
                                                                        Mining usage patterns is an efficient method to improve the
This paper analyzes whether usage patterns can be harnessed to          quality of service provided by information retrieval mechanisms
improve navigability in a growing knowledge space. To this end,         in the web context. For example, usage patterns can be harnessed
it presents a characterization of two collaborative tagging             to improve ‘browsing experience’ via recommendation systems or
communities that target the management of scientific literature:        to predict buying patterns and consequently increase revenue of
CiteULike and Bibsonomy. We explore three main directions:              e-commerce operations [8][9][10][17][18][19][20].
First, we analyze the tagging activity distribution across the user
population. Second, we define new metrics for similarity in user        This work is motivated by the following conjecture: usage
interest and use these metrics to uncover the structure of the          patterns can be harnessed to present relevant, contextualized
tagging communities we study. The properties of the structure we        information and deal with the reduced navigability generated by
uncover suggest a clear segmentation of interests into a large          informational overload in large tagging communities.
number of individuals with unique preferences and a core set of         We present encouraging preliminary steps to substantiate the
users with interspersed interests. Finally, we offer preliminary        above conjecture: We characterize two collaborative tagging
results that suggest that the interest-based structure of the tagging   systems: (CiteULike [3] and Bibsonomy [4]) as a first step towards
community can be used to facilitate content retrieval and               a model to represent user interests based on tagging activity. After
navigation as communities scale.                                        introducing related work (Section 2), we present a formal
                                                                        definition for tagging communities (Section 3) and the
Categories and Subject Descriptors                                      communities and the data sets this study explores (Section 4). We
H.1.1 [General]: Systems and Information Theory - Information           then characterize tagging activity distribution among users
Theory. H.3.5 [Information Storage and Retrieval]: On-line              (Section 5) and we investigate the structure of user’s shared
Information Services - Web-based services.                              interests (Section 6). Finally, we present preliminary results on the
                                                                        efficacy of using contextualized attention based on the structure of
                                                                        shared interests to improve the navigability in the system (Section
General Terms                                                           7). Section 8 summarizes our findings and outlines future research
Measurement, Experimentation, Human Factors, Theory.                    directions.

Keywords                                                                2. RELATED WORK
Collaborative Tagging, Usage Patterns, Modeling User Attention,         Two types of techniques, implicit and explicit, are traditionally
CiteULike, Bibsonomy.                                                   used to elicit user preferences in the Web context [1][6][15].
                                                                        Explicit techniques are based on direct input from a user with
1. INTRODUCTION                                                         respect to her preferences and interests (e.g., page rating scales,
Collaborative tagging systems are online communities that allow         item reviews, categories of interest). Implicit techniques infer a
users to assign terms from an uncontrolled vocabulary (i.e., tags)      definition of user interests from her activity, e.g., using client-side
to items of interest. This simple tagging feature proves to be a        or service-side mechanisms such as browser plug-ins, client
powerful mechanism for personal knowledge management (e.g.,             extensions, and server-side logs to track usage patterns. Clearly,
in systems like CiteULike [3]) and content sharing (e.g., in            each technique has its own advantages and limitations in terms of
communities such as Flickr [25]). Recently, collaborative tagging       accuracy, cost to the user, privacy control, or ability to adapt to
systems have attracted massive user communities: Novak et al.           changes on user interests trends.
In a tagging community context, the tags themselves can be                that, to facilitate browsing through tagging systems, it is
interpreted as explicit metadata added by each user. Additionally,        increasingly important to take into account user attention in terms
observed tagging activity including the volume and frequency              of observed tagging activity.
with which items are added, the number of tagged items, or tag
                                                                          Niwa et al. [17] propose a recommendation system based on the
vocabulary size can be harnessed to extract implicit information.
                                                                          affinity between users and tags, and on the explicit site
Due to the youth of collaborative tagging systems, relatively little      preferences expressed by the user. Our study differs from this
work has been done on tracking usage and exploring                        work as we use implicit user profiles and propose the use of
contextualized user attention in these communities. However,              entropy as a metric to characterize their effectiveness.
several studies present techniques and models for collecting and
                                                                          Outside the academic area, a number of projects explore the use of
managing user attention metadata in the wider web context
                                                                          implicitly-gathered user information. We mention Google's
without exploring tagging features [1][6][15]. These techniques
                                                                          initiative to explore users’ past search history to refine the results
include post processing of usage logs, tracking user input (e.g.
                                                                          provided by the Page Rank [8][9]. Commercial interest in
search terms) and eliciting explicit user preferences. Other
                                                                          contextualized user attention highlights that tracking user
investigations are concerned with methods to use contextualized
                                                                          attention and characterizing collective online behavior is not only
attention to improve web search [1][15].
                                                                          an intriguing research topic, but also a potentially attractive
As a first step to modeling user attention in tagging communities,        business opportunity.
it is necessary to characterize collaborative tagging behavior. In
this respect, Golder and Huberman [5] study user activity patterns
                                                                          3. BACKGROUND
regarding system utilization and tag usage in del.icio.us – a social
                                                                          A collaborative tagging community allows users to tag items via a
bookmarking tool that allows users to share and tag URLs. First,
                                                                          web site. Users interact with the website by searching for items,
they observe a low correlation between the number of items in
                                                                          adding new items to the community, or tagging existent items.
each user's bookmark list and the number of tags used by each
                                                                          The tagging action performed by a user is generally referred as a
user. Next, they discuss the models that could explain this lack of
                                                                          tag assignment.
correlation and suggest it is an effect of shared knowledge and
imitation in associating tags. Finally, the authors suggest that the      For example, in CiteULike and Bibsonomy, each user has a
urn model proposed by Eggenberger & Polya [14] is an                      library, i.e., a set of links to scientific publications and books.
appropriate model to derive the evolution of tag usage frequency          Each item in the library is associated with a set of terms (tags)
on a particular item.                                                     assigned by users. It is important to highlight that, in both
                                                                          CiteULike and Bibsonomy, the process of assigning tags to items
The urn model can be formulated as follows: consider an urn that
                                                                          is collaborative, in the sense that all users can inspect other users’
contains two colored balls. Iteratively, a ball is drawn at random
                                                                          libraries and assigned tags. User can thus repeat tags used by
from the urn and returned to the urn together with a new ball of
                                                                          others to mark a particular item. This is unlike other communities
the same color. If this process is repeated a number of times, the
                                                                          (e.g., Flickr) where each user has a fine-grained access control to
fraction of balls of a particular color stabilizes. The interesting
                                                                          define who has permissions to see the content and apply tags to it.
aspect of this model is that if the process is restarted, this fraction
converges to a different number. Golder and Huberman argue that           In CiteULike and Bibsonomy users have two options to add items
this model captures the evolution of tag proportion observed in           to their libraries:
the del.icio.us data set. In studies related to contextualized user
attention, this model may be valuable to predict future user              1. Browse the content of popular scientific literature portals
tagging assignments which can be a useful input to                            (e.g. ACM Portal, IEEE Explorer, arXiv.org), to add
recommendation mechanisms. Golder and Huberman’s study,                       publications to their own library, and
however, is limited in scale: their results on tagging behavior           2. Search for items present in other users' libraries and add
dynamics rely on only four days of tracked activity.                          them to their own library.
Other authors follow different approaches to investigate the              While posting an item, a user can mark it with terms (i.e., tags)
characteristics of tagging systems. Schimtz [10][11] studies              that can be used for future retrieval. The collaborative nature of
structural properties of del.icio.us and Bibsonomy, uses a tri-           tagging relies on the fact that users potentially share interests and
partite hypergraph representation, and adapts the small-world             use similar items and tags. Thus, while the tagging activity of one
pattern definitions to this representation. Cattuto et al. [12] model     user may be self-centered the set of tags used may facilitate the
usage behavior via unipartite projections from a tripartite graph.        job of other users in finding content of interest.
Our approach differs from these studies in terms of scale and in
the use of dynamic metrics to define shared user interest: we             We represent a collaborative tagging community by the tuple:
define metrics that scale as the community grows and/or user              C=(U,I,T,A), where U represents the set of users, I is the set of
activity increases (Section 6).                                           items, T is the tagging vocabulary, and A the set of tag
                                                                          assignments.
By analyzing del.icio.us, Chi and Mytkoswicz [2] find that the
efficiency of social tagging decreases as the communities grow:           The set of tag assignments is denoted by A = {(u, t, p) | u ∈ U, t ∈
that is, tags are becoming less and less descriptive and                  T, p∈ I}. From this definition of tag assignments, we can derive
consequently it becomes harder to find a particular item using            the definition of an individual user, item and tag, as follows:
them. Simultaneously, it becomes harder to find tags that
efficiently mark an item for future retrieval. These results indicate
 A user uk ∈ U is denoted by a pair uk = (Ik, Tk), where Ik is the      approximately 14% of the original data set, while the users
  set of items user k has ever tagged. Thus, an item p ∈ Ik if and       removed from Bibsonomy are around 0.6% of the original data
                                                                         set. Table 1 summarizes the characteristics of each data set after
  only if ∃ (uk,t, p) ∈ A, for any t ∈ Tk. Similarly, Tk is the set of   the data cleaning operation.
  tags user uk applied before, where t ∈ Tk if and only if ∃ (uk, t,
  p) ∈ A.                                                                5. TAGGING ACTIVITY
                                                                         To gain an understanding on the usage patterns in these two
 An item pi ∈ I is denoted by pi = (Ui , Ti ), where Ui is the set of   communities, we start by evaluating the activity levels along
  users who tagged this item, and Ti is set of tags this item has        several metrics: the number of items per user, number of tagging
  received.                                                              assignments performed, and number of tags used. The question
                                                                         answered in this section is the following:
 A tag tj ∈ T is denoted by tj = (Uj , Ij ), where Uj is the set of
  users who used the tag tj before, and Ik is the set of items           Q1: How is the tagging activity distributed among users?
  annotated with the tag tj.
                                                                         We aim to quantify the volume of user interaction with the
                                                                         system, either by adding new content to the community, or by
4. DATA SETS AND DATA CLEANING                                           tagging an existing item. Intuitively, one would expect that a few
Both tagging communities we analyze: CiteULike [3] and                   users are very active while the majority rarely interacts with the
Bibsonomy [4], aim to improve user’s organization and                    community.
management of research publications. Both provide functionality
                                                                         Determining how often users perform tag assignments is
to import and export citation records in formats like BibTeX, for
                                                                         important to help designing systems that track user attention. For
example.
                                                                         example, in a context where activity information is used to
The data sets analyzed in this article were provided by the              recommend new items based on tag similarity, it can be necessary
administrators of the respective web sites. Thus, the data               to compute the similarity level at the same rate as the rate with
represents a global snapshot of each system within the period            which new information is added into the system. Figure 1 presents
determined by the timestamps in the traces we have obtained              the user rank according to the number of tag assignments
(Table 1). It is important to point out that the Bibsonomy data set      performed during the time frame of our data set. In the results that
has timestamps starting at 1995, which we considered a bug.              follow, we present the data points observed together with a curve
Moreover, Bibsonomy has two separate datasets, scientific                that provides a good model to the observed data (i.e. Hoerl
literature and URL bookmarks. We concentrated our analysis on            function [21]). At the end of this section we comment more on the
the scientific literature part of the data.                              characteristics of this curve.
In the original CiteULike data set, the most popular tag is “bibtex-
import” while the second most popular tag is “no-tag”,
automatically assigned when a user does not assign any tag to a
new item. The popularity of these two tags indicates that a large
part of users use CiteULike as a tool to convert their list of
citations to BibTex format, and that users tend not to tag items at
the time they post a new item to their individual libraries. Clearly,
this is relevant information for system designers who might want
to invest effort in improving the features of most interest.
                                                                          Figure 1: User rank based on the number of tag assignments.
Also, in CiteULike one user posted and tagged more than 3,000                       Note the logarithmic scales on both axes.
items within approximately 5 minutes (according to the                   A second metric for tagging activity is the size of user libraries.
timestamps in the data set). Obviously, this behavior is due to an       Figure 2 plots user library size for users ranked in decreasing
automatic mechanism.                                                     order according to the size of their libraries for CiteULike and
   Table 1: Summary of cleaned data sets used in this study              Bibsonomy, respectively. This shows the size of the set of items a
                                                                         particular user pays attention to. The results confirm that the users
                         CiteULike                 Bibsonomy             in these two systems are heterogeneous in terms of activity
 Period              11/2004—04/2006              ??—12/2006             intensity, as it has already been indicated by the tag assignment
 # Users (|U|)             5,954                      656                activity.
 # Items (|I|)            199,512                    67,034
 # Tags (|T|)             51,079                     21,221
 # Assignments (|A|)      451,980                   257,261
Our objective is to concentrate only on those users who are using
the system interactively to bookmark and share articles.
Consequently, for the analysis that follows, we have the “robot”
user (i.e., a user with 3,000 items tagged within 5 minutes) and
users who used only the tags bibtex-import and/or no-tag. The
total number of users removed from CiteULike represents                           Figure 2: User rank based on the library size.
The correlation between a user’s library size and her vocabulary is      collaborative tagging community, one may draw a comparison
important to understand whether the diversity of the vocabulary          between the potential diversity found in the users' library
used by each user grows with the number of items in her personal         regarding the number of items in it, and the bio-diversity
library. We observe that, in both communities the users’ library         distribution across geographic regions.
and vocabulary sizes are strongly correlated for CiteULike
                                                                         Although a Hoerl function is a good fit for the activity
(R2 = 0.98, n = 5954) and less strongly, but still positively
                                                                         distributions, this does not directly imply that diversity of user
correlated, for Bibsonomy (R2 = 0.80, n = 654). Although such
                                                                         libraries or vocabularies represents a phenomenon which is
correlation may seem intuitive, since users with a more diverse set
                                                                         similar to those presented by studies on biodiversity.
of items would need more tags to describe them, this behavior is
                                                                         Nevertheless, the Hoerl function does provide a good model for
different from that observed by Golder and Huberman in
                                                                         collaborative tagging activity and it can be useful to study user
del.icio.us [5]. A possible explanation is that in del.icio.us user is
                                                                         diversity in collaborative tagging systems in the future.
presented with tag suggestions based on past tagging activity
when adding a new bookmark. These suggestions may bias and               To summarize: in the communities we study, the intensity of user
limit the size of a user vocabulary. However, further investigation      activity is distributed over multiple orders of magnitude, it is well
is necessary to assess how a user vocabulary is affected by tagging      modeled using the Hoerl function and, unlike in other
recommendation.                                                          communities, there is a strong correlation in activity in terms of
                                                                         items set and vocabulary sizes.

                                                                         6. EVALUATING USER SIMILARITY
                                                                         While the analysis above is important for an overall usage profile
                                                                         evaluation of each community, it provides little information about
                                                                         user interests. Assessing the commonality in user interests is
                                                                         important for identifying user groups that may form around
                                                                         content of common interest. Thus, a natural set of questions that
             Figure 3: User rank by vocabulary size                      we aim to answer in this section are:

A second finding is that the tagging activity (i.e., number of           Q2: Is the tagging community segmented into several sub-
tagging assignments) and library size per user are strongly              communities with different interests? Do users cluster around
correlated for both communities (with R2 above 0.97) while the           particular items and tags?
correlations between the tagging activity and the vocabulary size        To address these questions, we define the interest-sharing graph
is strong for CiteULike (R2 = 0.99), but weaker for Bibsonomy            after the intuition of data-sharing graphs introduced by Iamnitchi
(R2 = 0.67).                                                             et al. [27]. An interest-sharing graph captures the commonality in
A third finding is that tagging activity distributions are not well      user interest for an entire user population: Intuitively, users are
modeled by a Zipf-like distribution. Instead, a Hoerl model [21]         connected in the interest-sharing graph if they focus on the same
that extends the power-law family and it is defined by Equation          subset of items and/or speak similar language (i.e., share a subset
(1) fits better:                                                         of tags).

                            f(x) = abxxc                   (1)           More formally, consider a graph G = (U, E) where nodes are users
                                                                         and edges represent the existence of shared interests or activity
Table 2 contains the Hoerl parameters a, b, and c determined via a       similarity between users. The rest of this study explores three
curve fitting process for each of the ranking distributions              possible definitions for user interest or activity similarity. All
observed.                                                                these definitions employ a threshold t for the percentage of items
    Table 2: Coefficients determined for the Hoerl function              or tags shared between two users:
CiteULike                     a               b               c          1) The User-Item similarity definition considers two users’
                                                                            interests similar if the ratio between the sizes of the
Tag Assignments           9,767.13         0.9979         -0.4754
                                                                            intersection and the size of the union of their item libraries is
Library Size              2,609.77         0.9988         -0.4772           larger than a threshold t. This is expressed by Equation 2.
Vocabulary Size           3,338.55         0.9992         -0.5964
Bibsonomy
                                                                                                      Ik ∩ I j
                                                                                      ekj ∈ E ⇔                                User-Item (2)
Tag Assignments          28,969.29         0.9864         -0.6888                                     Ik ∪ I j
Library Size              6,137.49         0.9850         -0.5461
                                                                         2) The User-Tag definition is similar to the definition above but
Vocabulary Size           2,608.45         0.9907         -0.5126            considers the vocabularies of the two users rather than their
                                                                             libraries.
Similar to the Zipf distribution, the Hoerl function has been used
to model a large number of natural phenomena. The most relevant
to collaborative tagging is the use of Hoerl function to describe
                                                                                                     Tk ∩ T j
the distribution of bio-diversity across a geographic region
                                                                                      ekj ∈ E ⇔                                User-Tag (3)
[22][24]. Considering each user's library a region in a
                                                                                                     Tk ∪ T j
3) Unlike the User-Item definition in Equation 2 above, the           increases (Note that we exclude isolated nodes from this count of
   Directed User-Item considers two users’ interests similar if       connected graph components).
   the ratio between the intersection of their item libraries and
                                                                      The plots in Figure 4 show that the number of connected
   the size of one user library is larger than a threshold t. The
                                                                      components increases up to a certain value of our similarity
   idea is to explore the role played by users with large libraries
   via the introduction of direction to the edges in the graph.       threshold. After a certain value of t, the number of connected
                                                                      components in the graph starts decreasing, since more and more
                                                                      connected components will contain only one node and will thus
                            Ik ∩ I j                                  be excluded. The critical threshold value is different for each user
             ekj ∈ E ⇔                     Directed-User-Item (4)
                                                                      similarity definition.
                               Ik
In our analysis of real tag assignment traces from the two tagging
communities, even with low values for the sharing ratio threshold
t, the final graph contains a large number of isolated nodes.
Indeed, by setting the threshold as low as one single item (i.e.,
two users are connected if they share at least one item); we find
that, in CiteULike, 2,672 users (44.87%) are not connected to any
other user. This suggests that a large population of users has
individual preferences.


                                                                      Figure 5: Total number of nodes in the interest sharing graph
                                                                          and in the largest component for CiteULike (top) and
                                                                                           Bibsonomy (bottom)

                                                                      The initial increase in the number of connected components can
                                                                      be explained by the fact that, as the threshold increases, large
                                                                      components split to form new islands. Since these islands form
                                                                      naturally based on user similarity this result is encouraging since
  Figure 4: Number of connected components for CiteULike              it offers the potential to cluster users according to their interests.
               (top) and Bibsonomy (bottom)                           As t continues to increase the definition of similarity becomes too
                                                                      strict and leads to more and more isolated nodes.
Figure 4 presents, for the three similarity metrics defined above,
                                                                      Two observations about the results in Figure 4 can be noted: first,
the number of connected components for both CiteULike and
                                                                      as the threshold increases, the number of components decreases
Bibsonomy, for thresholds t varying from 1% to 99%. These
                                                                      faster for the User-Item graph (Equation 2) than for the Directed-
results show that regardless of the graph definition the number of
                                                                      User-Item graph (Equation 4). This illustrates the effect of using
connected components follow a similar trend as the threshold
                                                                      an asymmetrical definition for shared interests. The idea explored
here is similar to the friendship graph, where connections               distribution gets closer to a uniform random distribution) and,
between two nodes are not reciprocal (e.g., A might consider B to        second, as the set becomes larger.
be his friend, but, at the same time, B is indifferent to A) [23].
Similarly, in the directed User-Item graph the size of the user data
set is considered leading to an asymmetrical definition of user
interest (e.g., if all items of A data set are included in B’s, then A
will consider she has shared interests with B, while, depending of
the overall size of his item set B might not consider his interests
are similar enough to be connected to A’s).
Second, there are more isolated nodes (i.e., zero-degree nodes) in
the User-Item and Directed-User-Item graphs than in the User-Tag
graph (defined by Equation 3). This indicates that users tend to
have larger overlaps among their vocabularies than among their
libraries. One reason for this observed interest sharing pattern may
be the fact that users browse and consume items from other users'
libraries without necessarily adding those items to their personal
libraries in the system. An approach to verify this observation is to
perform an analysis of user browsing histories to determine how            Figure 6: Entropy growth considering items in CiteULike
often users download items from others’ libraries without adding         In practical terms, in a collaborative tagging community, the
them to their personal set of items. From a system design                increase in entropy of an item set means that the user needs to
perspective, knowing that users are more likely to share tags than       filter out more items to find the one she is interested in. Similarly,
data items may be useful for designing item recommendation               high entropy makes it harder to find a tag that describes an item
heuristics based on vocabulary overlap.                                  well. Conversely, lower entropy makes it potentially easier for a
                                                                         user to reach an item of interest. Thus, the question to be
 All the similarity definitions above generally divide the original
                                                                         answered in this section is the following:
graph into one giant component, several tiny components, and a
large number of isolated nodes. Figure 5 presents the total number       Q3: Can the interest-sharing graph be used to reduce the entropy
of nodes in the components with at least two nodes and the               perceived by users when navigating through the system?
number of nodes in the largest connected component for
thresholds varying from 1% to 99% for the three similarity               Our two-part answer is briefly presented below and detailed in the
measures defined above.                                                  rest of this section. First, we demonstrate that the interest-sharing
                                                                         graph can be used to reduce the entropy perceived by users. To
The results presented in this section demonstrate that using a           this end we define a user’s neighborhood as its set of neighbors in
similarity metric and the resulting interest-sharing graph it is         the sharing graph and show that this construction can be used to
possible to segment the user population according to manifested          present users with an item set with low entropy.
interest. Based on this intuition, we conjecture that it is possible
to build tag/item recommendation mechanisms that exploit usage           Second, we offer preliminary results that suggest that this
patterns, i.e., the shared interests among users. The next section       segmentation of the user population based on neighborhoods in
offers a preliminary analysis of this hypothesis.                        the interest-sharing graph has a good predictive power: the items
                                                                         consumed by a user’s neighbors predict well the future
                                                                         consumption pattern of that user. Thus, this offers a path to build
7. IMPROVING NAVIGABILITY                                                recommendations systems based on the interest-sharing graph.
Chi and Mytcowicz [2] report that navigability, defined as users’
ability to find relevant content, decreases as a tagging community       The rest of this section presents the above two-step exploration in
grows. More precisely, Chi and Mytcowicz imply that the                  detail. We first define a user’s ‘neighborhood entropy’ as the
decrease in navigability is due to an increase in diversity in the set   entropy of the union of the item sets of that user’s neighbors in
of items, users, and tags.                                               the interest-sharing graph. To demonstrate that our group
                                                                         selection technique is effective in reducing entropy, Figure 7
We have verified that the diversity of the data present in the two       compares the average neighborhood entropy in the interest-
communities we study grows over time. Figure 6 presents the              sharing graph with two types of other constructions. All results
evolution of entropy, as the metric to evaluate item diversity, for      are reported with 95% confidence intervals.
the CiteULike data since November 2004.
                                                                         Firstly, we compare for CiteULike, the average neighborhood
As a reminder, we note that the entropy of a set of items I, where       entropy in the largest connected component of an interest sharing
|I| = N, is defined as:                                                  graph (Average Entropy – Figure 7), which is built by using the
                                       N                                 real tagging activity trace, to the total entropy in the system
                            H ( I ) = −∑ pi ⋅ log( pi )    (5)           (almost double at 11.6 as presented in Figure 6). Similarly, the
                                       i =1                              neighborhood entropy in the largest connected component is
where pi is the popularity of an item i.                                 compared to the total entropy in the same component (Largest
                                                                         Component – Figure 7). Secondly, we compare with two random
The entropy of a set may increase in two circumstances: First, as        graph constructions: the entropy of a random graph, which is built
the randomness in the item set increases (i.e., the popularity           by selecting random neighbors from the set of users in the largest
connected component (Random Component – Figure 7); the                  8. CONCLUSIONS & FUTURE WORK
second random graph is built by selecting random from the entire        This work presents a characterization of two collaborative tagging
user set (Random Graph - Figure 7).                                     systems, CiteULike and Bibsonomy, as a first step to help extract
                                                                        implicit information about user attention in collaborative tagging
                                                                        systems.
                                                                        First, we analyze the distribution of tagging activity, i.e., the
                                                                        distribution of the volume of items, tags, and tagging actions
                                                                        related to each user’ activity in the tagging community. We find
                                                                        that the activity distribution is highly heterogeneous along all
                                                                        these multiple axes: a few active users contribute with a large
                                                                        number of tag assignments and maintain a large number of items
                                                                        and tags, while the majority of users have a modest tagging
                                                                        activity.
                                                                        Additionally, users with large libraries tend to have a large
                                                                        vocabulary of tags. While this may seem intuitive, this is not the
                                                                        norm across all tagging systems. In del.icio.us, for example user’s
                                                                        library size and number of tags used are uncorrelated.
                                                                        Second, we define the interest-sharing graph and investigate
                                                                        several definitions for interest similarity based on user activity in
Figure 7: The neighborhood entropy for several sharing ratio            terms of items and vocabularies employed. Our main findings can
  thresholds in CiteULike using User-Item similarity metric             be summarized as follows:
                       (Equation 2).                                    1.   Both communities present a large population of isolated
We observe that the average ‘neighborhood entropy’ in the                    users (zero-degree nodes in the interest-sharing graph). This
interest sharing graph is lower for almost all thresholds analyzed:          indicates that there are a large number of users with unique
First, the ‘neighborhood entropy’ in the interest sharing graph is           preferences. On the other hand, by introducing direction in
between one half and one tenth of the entropy of the entire                  the graph of shared interests, it is possible to reduce the
dataset. Additionally, for all thresholds analyzed except for the            number of isolated nodes. The final main directed connected
1% threshold that does not offer enough discrimination the                   component contains approximately twice more nodes than
‘neighborhood entropy’ is significantly (24% to 400%) lower than             the undirected one.
that of similar random constructions.
                                                                        2.   The structural analysis reveals the existence of a significant
To support our hypothesis that the interest-sharing graph is a good          number of small sub-communities of interests totally
basis to develop recommendation systems, we analyze how                      separated from each other.
efficient the neighbor’s item set in predicting future user attention
over items. To this end, we evaluate the hit ratio: the proportion      3.   The structure of the interest-sharing graph can be used to
of items a user adds to her library at time T+1 that are already in          reduce the diversity of the items a user is exposed to. To
her neighbor’ libraries at time T.                                           quantify this reduction we compare the entropy of the
                                                                             ‘neighborhood item set’ with that of the item set of the entire
To evaluate the hit ratio, we considered the interest-sharing graph          CiteULike/Bibsonomy item set, that of the main connected
based on the User-Item similarity metric with 1% sharing ratio               component, and that of various constructions of random item
threshold. Preliminary results show that depending on the                    sets of similar sizes.
granularity considered (that is the length of our forecasting
period: interval between T and T+1) the hit rate is as high as 20%      4.   Finally, we provide preliminary evidence that suggests that
for one hour granularity and decays to a low of 5% for a                     user’s activity can be predicted by considering the union of
one-month forecast granularity. This indicates that a user’s                 the item sets of a node’s neighbors in the interest sharing
neighborhood is a possible source of information to predict near             graph. We conjecture that this property can be used to build
future user attention and its predictive effectiveness decreases for         efficient, online recommendation systems for tagging
longer time intervals.                                                       communities.

The findings presented on this section can be summarized as             This work inspires a fresh set of questions on collaborative
follows: First, we verify that the entropy increases as the             tagging communities and contextualized attention.
CiteULike community grows overtime; Second, we verify that the          A possible approach to implicitly extract user attention is to
interest-sharing graph can be used to reduce the entropy of item        answer the following question: What are the patterns of
sets presented to users; and finally, we find that a user’s             information propagation in collaborative tagging systems? An
‘neighborhood’ in the interest-sharing graph is a possible source       answer to this question could build on previous work on
of information to predict future user actions. While the                information diffusion in the blogspace [7] by exploring the
preliminary data we present here is encouraging we continue tour        evolution of user attention overtime.
exploration for a complete analysis for diverse interest-sharing
graphs based on multiple similarity metrics and thresholds.             A second intriguing issue to explore is the following How
                                                                        malicious behavior affects a tagging system and whether it be
automatically detected? Search results that are manipulated by        [12] Cattuto, C., Loreto, V. and Pietronero, L. "Semiotic
tagging misbehavior can have an impact on usage in a                       dynamics and collaborative tagging". In PNAS, vol. 104,
collaborative tagging community [13]. Automatic detection of               no. 5, 1461–1464, January, 2007.
malicious users is paramount to the long term survival of these       [13] Koutrika, G., Effendi, F., Gyongyi, Z., Heymann, P., Garcia-
communities.                                                               Molina, H. “Combating Tag Spam”. In Proceedings of
Finally, exploring the structure of user interest overtime can help        AirWeb Workshop, 2007.
devising models for the formation and evolution of the user           [14] Eggenberger, F. and Polya, G. (1923). “Uber die Statistik
similarity graphs, similarly to the study by Kumar et al. [16] on          verketteter Vorgange”. Zeitschrift fur Angewandte
online social networks.                                                    Mathemathik und Mechanik, 3, 279—289.
                                                                      [15] Pitkow, J., Schutze, H., Todd, C., Cooley, R., Turnbull, D.,
ACKNOWLEDGMENTS                                                            Edmonds, A., Adar, E., Breuel, T. “Personalized search”.
The authors would like to thank Richard Cameron for providing              Communications of the ACM (The consumer side of search),
the CiteULike data set; Christoph Schmitz for providing the                Volume 45 (9), 50–55, 2002.
Bibsonomy data set; professor Lee Iverson for insightful
                                                                      [16] Novak, J., Kumar, R. and Tomkins, A. “Structure and
discussions on early stages of this work, and Armin
                                                                           Evolution of On Line Social Networks”. In Proceedings of
Bahramshahry, Samer Al Kiswany and Nazareno Andrade for
                                                                           12th ACM SIGKDD, poster, 2006.
their valuable comments. The graph analysis was executed in
parallel using OurGrid (http://www.ourgrid.org).                      [17] Niwa, S., Doi, T. and Honiden, S. “Web Page Recommender
                                                                           System based on Folksonomy Mining”. In Proceedings of the
                                                                           3rd International Conference on Information Technology:
REFERENCES                                                                 New Generations, 2006.
[1] Lawrence, S. “Context in Web Search”, IEEE Data
    Engineering Bulletin, Volume 23, Number 3, pp. 25–32,             [18] Li, J. and Zaiane, O. “Combining Usage, Content and
    2000.                                                                  Structure Data to Improve Web Site Recommendation”. In
                                                                           Proceedings of WebKDD-2004 Workshop on Web Mining
[2] Chi, E. H. and Mytkowicz, T. “Understanding Navigability               and Web Usage, 2004.
    of Social Tagging Systems”. In Proceedings of CHI'07,
    February, 2007.                                                   [19] Kazienko, P. and Kiwera, M. “Integration of Relational
                                                                           Databases and Web Site Content for Product and Page
[3] CiteULike. http://www.citeulike.org, 2007.                             Recommendation”. In Proceedings of Database Engineering
[4] Bibsonomy. http://www.bibsonomy.org, 2007.                             and Applications Symposium, 2004.
[5] Golder, S. and Huberman, B. “The Structure of                     [20] Mathes,    A.  “Evaluating  Collaborative     filtering
    Collaborative Tagging Systems”. Journal of Information                 Recommender Systems”. ACM Transactions on Information
    Science, Vol. 32, No. 2, 198-208 (2006)..                              Systems, 2004.
[6] Najjar, J., Wolpers, M., and Duval, E. L. “Attention              [21] Hoerl, A. E., “Fitting Curves to Data”. In Chemical
    Metadata: Collection and Management”. In Proceedings of                Business Handbook, Perry, J. (ed), Section 20, pp. 20--55,
    Workshop on Logging Traces of Web Activity: The                        1954.
    Mechanics of Data Collection, Edinburgh, Scotland, May 23,        [22] Lambshead, P. J. D., Brown, C. J., Ferrero, T. J, Hawkins, L.
    2006.                                                                  E., Smith, C. R and Mitchell, N. J. “Biodiversity of
[7] Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A.                     nematode assemblages from region of the Clarion-
    “Information Diffusion through Blogspace”. Proceedings of              Clipperton Fracture Zone”. In BioMed Central Ecology,
    the 13th International Conference on World Wide Web,                   3:1, 2003.
    Pages: 491 – 501, 2004.                                           [23] Doreian, P., Batagelj, V. and Ferligoj, A. “Generalized
[8] Zamir, O. E., Korn, J. L., Fikes, A. B., Lawrence, S. R.               Blockmodeling”, Cambridge University Press, 2005.
    “Personalization of placed content ordering in search             [24] Strong, Jr., D. R., McCoy, E. D. and Rey, J. R. “Time and
    results”. US-Patent Application 20050240580.                           the Number of Herbivore Species: The Pests of Sugarcane”.
[9] Haveliwala, T., Jeh, G. M., Kamvar, S. D. “Results based               In Ecology, Vol. 58, No. 1. (Jan., 1977), pp. 167-175.
    personalization of advertisements in a search engine”. US         [25] Flickr. http://www.flickr.com, 2007.
    Patent Application 20050222989.
                                                                      [26] Del.icio.us Blog. http://blog.del.icio.us. September, 2006.
[10] Schimitz, C. "Small World Folksonomies: Clustering in Tri-
     Partite Hypergraphs". Univ. Kassel Tech Report, Nov/06.          [27] Iamnitchi, A., Ripeanu, M. and Foster, I..“Small-World File-
                                                                           Sharing Communities”, In INFOCOM, 2004
[11] Watts D. J, Strogatz S. H. "Collective Dynamics of 'small-
     world' networks". In Nature, vol. 393, June, 1998.