=Paper=
{{Paper
|id=Vol-266/paper-3
|storemode=property
|title=Tracking User Attention in Collaborative Tagging Communities
|pdfUrl=https://ceur-ws.org/Vol-266/paper03.pdf
|volume=Vol-266
|dblpUrl=https://dblp.org/rec/conf/jcdl/Santos-NetoRI07
}}
==Tracking User Attention in Collaborative Tagging Communities==
Tracking Usage in Collaborative Tagging Communities
Elizeu Santos-Neto Matei Ripeanu Adriana Iamnitchi
Electrical & Computer Engineering Electrical & Computer Engineering Computer Science and Engineering
University of British Columbia University of British Columbia University of South Florida
2332 Mail Mall – KAIS 4075 2356 Mail Mall – KAIS 4033 4202 E. Fowler Ave, Tampa, FL
+1.604.827.4270 +1.604.822.7281 +1.813.974.5357
elizeus@ece.ubc.ca matei@ece.ubc.ca anda@cse.usf.edu
ABSTRACT [16] report that in January 2006 Flickr congregated about one
Collaborative tagging has recently attracted the attention of both million users. Similarly, del.icio.us reached one million users in
industry and academia due to the popularity of content-sharing September 2006 [26].
systems such as CiteULike, del.icio.us, and Flickr. These systems Although collaborative tagging is attracting increasing attention
give users the opportunity to add data items and to attach their from both industry and academia, there are few studies that assess
own metadata (i.e., tags) to stored data. The result is an effective the characteristics of communities of users who share and tag
content management tool for individual users. Recent studies, content. In particular, little research has been done on the
however, suggest that, as tagging communities grow, the added potential benefits of tracking usage patterns in collaborative
content and the metadata become harder to manage due to tagging communities. Moreover, recent investigations have shown
increased content diversity. Thus, mechanisms that cope with that, as the user population grows, the efficiency of information
increase of diversity are fundamental to improve the scalability retrieval based on user generated tags tends to decrease [2].
and usability of collaborative tagging systems.
Mining usage patterns is an efficient method to improve the
This paper analyzes whether usage patterns can be harnessed to quality of service provided by information retrieval mechanisms
improve navigability in a growing knowledge space. To this end, in the web context. For example, usage patterns can be harnessed
it presents a characterization of two collaborative tagging to improve ‘browsing experience’ via recommendation systems or
communities that target the management of scientific literature: to predict buying patterns and consequently increase revenue of
CiteULike and Bibsonomy. We explore three main directions: e-commerce operations [8][9][10][17][18][19][20].
First, we analyze the tagging activity distribution across the user
population. Second, we define new metrics for similarity in user This work is motivated by the following conjecture: usage
interest and use these metrics to uncover the structure of the patterns can be harnessed to present relevant, contextualized
tagging communities we study. The properties of the structure we information and deal with the reduced navigability generated by
uncover suggest a clear segmentation of interests into a large informational overload in large tagging communities.
number of individuals with unique preferences and a core set of We present encouraging preliminary steps to substantiate the
users with interspersed interests. Finally, we offer preliminary above conjecture: We characterize two collaborative tagging
results that suggest that the interest-based structure of the tagging systems: (CiteULike [3] and Bibsonomy [4]) as a first step towards
community can be used to facilitate content retrieval and a model to represent user interests based on tagging activity. After
navigation as communities scale. introducing related work (Section 2), we present a formal
definition for tagging communities (Section 3) and the
Categories and Subject Descriptors communities and the data sets this study explores (Section 4). We
H.1.1 [General]: Systems and Information Theory - Information then characterize tagging activity distribution among users
Theory. H.3.5 [Information Storage and Retrieval]: On-line (Section 5) and we investigate the structure of user’s shared
Information Services - Web-based services. interests (Section 6). Finally, we present preliminary results on the
efficacy of using contextualized attention based on the structure of
shared interests to improve the navigability in the system (Section
General Terms 7). Section 8 summarizes our findings and outlines future research
Measurement, Experimentation, Human Factors, Theory. directions.
Keywords 2. RELATED WORK
Collaborative Tagging, Usage Patterns, Modeling User Attention, Two types of techniques, implicit and explicit, are traditionally
CiteULike, Bibsonomy. used to elicit user preferences in the Web context [1][6][15].
Explicit techniques are based on direct input from a user with
1. INTRODUCTION respect to her preferences and interests (e.g., page rating scales,
Collaborative tagging systems are online communities that allow item reviews, categories of interest). Implicit techniques infer a
users to assign terms from an uncontrolled vocabulary (i.e., tags) definition of user interests from her activity, e.g., using client-side
to items of interest. This simple tagging feature proves to be a or service-side mechanisms such as browser plug-ins, client
powerful mechanism for personal knowledge management (e.g., extensions, and server-side logs to track usage patterns. Clearly,
in systems like CiteULike [3]) and content sharing (e.g., in each technique has its own advantages and limitations in terms of
communities such as Flickr [25]). Recently, collaborative tagging accuracy, cost to the user, privacy control, or ability to adapt to
systems have attracted massive user communities: Novak et al. changes on user interests trends.
In a tagging community context, the tags themselves can be that, to facilitate browsing through tagging systems, it is
interpreted as explicit metadata added by each user. Additionally, increasingly important to take into account user attention in terms
observed tagging activity including the volume and frequency of observed tagging activity.
with which items are added, the number of tagged items, or tag
Niwa et al. [17] propose a recommendation system based on the
vocabulary size can be harnessed to extract implicit information.
affinity between users and tags, and on the explicit site
Due to the youth of collaborative tagging systems, relatively little preferences expressed by the user. Our study differs from this
work has been done on tracking usage and exploring work as we use implicit user profiles and propose the use of
contextualized user attention in these communities. However, entropy as a metric to characterize their effectiveness.
several studies present techniques and models for collecting and
Outside the academic area, a number of projects explore the use of
managing user attention metadata in the wider web context
implicitly-gathered user information. We mention Google's
without exploring tagging features [1][6][15]. These techniques
initiative to explore users’ past search history to refine the results
include post processing of usage logs, tracking user input (e.g.
provided by the Page Rank [8][9]. Commercial interest in
search terms) and eliciting explicit user preferences. Other
contextualized user attention highlights that tracking user
investigations are concerned with methods to use contextualized
attention and characterizing collective online behavior is not only
attention to improve web search [1][15].
an intriguing research topic, but also a potentially attractive
As a first step to modeling user attention in tagging communities, business opportunity.
it is necessary to characterize collaborative tagging behavior. In
this respect, Golder and Huberman [5] study user activity patterns
3. BACKGROUND
regarding system utilization and tag usage in del.icio.us – a social
A collaborative tagging community allows users to tag items via a
bookmarking tool that allows users to share and tag URLs. First,
web site. Users interact with the website by searching for items,
they observe a low correlation between the number of items in
adding new items to the community, or tagging existent items.
each user's bookmark list and the number of tags used by each
The tagging action performed by a user is generally referred as a
user. Next, they discuss the models that could explain this lack of
tag assignment.
correlation and suggest it is an effect of shared knowledge and
imitation in associating tags. Finally, the authors suggest that the For example, in CiteULike and Bibsonomy, each user has a
urn model proposed by Eggenberger & Polya [14] is an library, i.e., a set of links to scientific publications and books.
appropriate model to derive the evolution of tag usage frequency Each item in the library is associated with a set of terms (tags)
on a particular item. assigned by users. It is important to highlight that, in both
CiteULike and Bibsonomy, the process of assigning tags to items
The urn model can be formulated as follows: consider an urn that
is collaborative, in the sense that all users can inspect other users’
contains two colored balls. Iteratively, a ball is drawn at random
libraries and assigned tags. User can thus repeat tags used by
from the urn and returned to the urn together with a new ball of
others to mark a particular item. This is unlike other communities
the same color. If this process is repeated a number of times, the
(e.g., Flickr) where each user has a fine-grained access control to
fraction of balls of a particular color stabilizes. The interesting
define who has permissions to see the content and apply tags to it.
aspect of this model is that if the process is restarted, this fraction
converges to a different number. Golder and Huberman argue that In CiteULike and Bibsonomy users have two options to add items
this model captures the evolution of tag proportion observed in to their libraries:
the del.icio.us data set. In studies related to contextualized user
attention, this model may be valuable to predict future user 1. Browse the content of popular scientific literature portals
tagging assignments which can be a useful input to (e.g. ACM Portal, IEEE Explorer, arXiv.org), to add
recommendation mechanisms. Golder and Huberman’s study, publications to their own library, and
however, is limited in scale: their results on tagging behavior 2. Search for items present in other users' libraries and add
dynamics rely on only four days of tracked activity. them to their own library.
Other authors follow different approaches to investigate the While posting an item, a user can mark it with terms (i.e., tags)
characteristics of tagging systems. Schimtz [10][11] studies that can be used for future retrieval. The collaborative nature of
structural properties of del.icio.us and Bibsonomy, uses a tri- tagging relies on the fact that users potentially share interests and
partite hypergraph representation, and adapts the small-world use similar items and tags. Thus, while the tagging activity of one
pattern definitions to this representation. Cattuto et al. [12] model user may be self-centered the set of tags used may facilitate the
usage behavior via unipartite projections from a tripartite graph. job of other users in finding content of interest.
Our approach differs from these studies in terms of scale and in
the use of dynamic metrics to define shared user interest: we We represent a collaborative tagging community by the tuple:
define metrics that scale as the community grows and/or user C=(U,I,T,A), where U represents the set of users, I is the set of
activity increases (Section 6). items, T is the tagging vocabulary, and A the set of tag
assignments.
By analyzing del.icio.us, Chi and Mytkoswicz [2] find that the
efficiency of social tagging decreases as the communities grow: The set of tag assignments is denoted by A = {(u, t, p) | u ∈ U, t ∈
that is, tags are becoming less and less descriptive and T, p∈ I}. From this definition of tag assignments, we can derive
consequently it becomes harder to find a particular item using the definition of an individual user, item and tag, as follows:
them. Simultaneously, it becomes harder to find tags that
efficiently mark an item for future retrieval. These results indicate
A user uk ∈ U is denoted by a pair uk = (Ik, Tk), where Ik is the approximately 14% of the original data set, while the users
set of items user k has ever tagged. Thus, an item p ∈ Ik if and removed from Bibsonomy are around 0.6% of the original data
set. Table 1 summarizes the characteristics of each data set after
only if ∃ (uk,t, p) ∈ A, for any t ∈ Tk. Similarly, Tk is the set of the data cleaning operation.
tags user uk applied before, where t ∈ Tk if and only if ∃ (uk, t,
p) ∈ A. 5. TAGGING ACTIVITY
To gain an understanding on the usage patterns in these two
An item pi ∈ I is denoted by pi = (Ui , Ti ), where Ui is the set of communities, we start by evaluating the activity levels along
users who tagged this item, and Ti is set of tags this item has several metrics: the number of items per user, number of tagging
received. assignments performed, and number of tags used. The question
answered in this section is the following:
A tag tj ∈ T is denoted by tj = (Uj , Ij ), where Uj is the set of
users who used the tag tj before, and Ik is the set of items Q1: How is the tagging activity distributed among users?
annotated with the tag tj.
We aim to quantify the volume of user interaction with the
system, either by adding new content to the community, or by
4. DATA SETS AND DATA CLEANING tagging an existing item. Intuitively, one would expect that a few
Both tagging communities we analyze: CiteULike [3] and users are very active while the majority rarely interacts with the
Bibsonomy [4], aim to improve user’s organization and community.
management of research publications. Both provide functionality
Determining how often users perform tag assignments is
to import and export citation records in formats like BibTeX, for
important to help designing systems that track user attention. For
example.
example, in a context where activity information is used to
The data sets analyzed in this article were provided by the recommend new items based on tag similarity, it can be necessary
administrators of the respective web sites. Thus, the data to compute the similarity level at the same rate as the rate with
represents a global snapshot of each system within the period which new information is added into the system. Figure 1 presents
determined by the timestamps in the traces we have obtained the user rank according to the number of tag assignments
(Table 1). It is important to point out that the Bibsonomy data set performed during the time frame of our data set. In the results that
has timestamps starting at 1995, which we considered a bug. follow, we present the data points observed together with a curve
Moreover, Bibsonomy has two separate datasets, scientific that provides a good model to the observed data (i.e. Hoerl
literature and URL bookmarks. We concentrated our analysis on function [21]). At the end of this section we comment more on the
the scientific literature part of the data. characteristics of this curve.
In the original CiteULike data set, the most popular tag is “bibtex-
import” while the second most popular tag is “no-tag”,
automatically assigned when a user does not assign any tag to a
new item. The popularity of these two tags indicates that a large
part of users use CiteULike as a tool to convert their list of
citations to BibTex format, and that users tend not to tag items at
the time they post a new item to their individual libraries. Clearly,
this is relevant information for system designers who might want
to invest effort in improving the features of most interest.
Figure 1: User rank based on the number of tag assignments.
Also, in CiteULike one user posted and tagged more than 3,000 Note the logarithmic scales on both axes.
items within approximately 5 minutes (according to the A second metric for tagging activity is the size of user libraries.
timestamps in the data set). Obviously, this behavior is due to an Figure 2 plots user library size for users ranked in decreasing
automatic mechanism. order according to the size of their libraries for CiteULike and
Table 1: Summary of cleaned data sets used in this study Bibsonomy, respectively. This shows the size of the set of items a
particular user pays attention to. The results confirm that the users
CiteULike Bibsonomy in these two systems are heterogeneous in terms of activity
Period 11/2004—04/2006 ??—12/2006 intensity, as it has already been indicated by the tag assignment
# Users (|U|) 5,954 656 activity.
# Items (|I|) 199,512 67,034
# Tags (|T|) 51,079 21,221
# Assignments (|A|) 451,980 257,261
Our objective is to concentrate only on those users who are using
the system interactively to bookmark and share articles.
Consequently, for the analysis that follows, we have the “robot”
user (i.e., a user with 3,000 items tagged within 5 minutes) and
users who used only the tags bibtex-import and/or no-tag. The
total number of users removed from CiteULike represents Figure 2: User rank based on the library size.
The correlation between a user’s library size and her vocabulary is collaborative tagging community, one may draw a comparison
important to understand whether the diversity of the vocabulary between the potential diversity found in the users' library
used by each user grows with the number of items in her personal regarding the number of items in it, and the bio-diversity
library. We observe that, in both communities the users’ library distribution across geographic regions.
and vocabulary sizes are strongly correlated for CiteULike
Although a Hoerl function is a good fit for the activity
(R2 = 0.98, n = 5954) and less strongly, but still positively
distributions, this does not directly imply that diversity of user
correlated, for Bibsonomy (R2 = 0.80, n = 654). Although such
libraries or vocabularies represents a phenomenon which is
correlation may seem intuitive, since users with a more diverse set
similar to those presented by studies on biodiversity.
of items would need more tags to describe them, this behavior is
Nevertheless, the Hoerl function does provide a good model for
different from that observed by Golder and Huberman in
collaborative tagging activity and it can be useful to study user
del.icio.us [5]. A possible explanation is that in del.icio.us user is
diversity in collaborative tagging systems in the future.
presented with tag suggestions based on past tagging activity
when adding a new bookmark. These suggestions may bias and To summarize: in the communities we study, the intensity of user
limit the size of a user vocabulary. However, further investigation activity is distributed over multiple orders of magnitude, it is well
is necessary to assess how a user vocabulary is affected by tagging modeled using the Hoerl function and, unlike in other
recommendation. communities, there is a strong correlation in activity in terms of
items set and vocabulary sizes.
6. EVALUATING USER SIMILARITY
While the analysis above is important for an overall usage profile
evaluation of each community, it provides little information about
user interests. Assessing the commonality in user interests is
important for identifying user groups that may form around
content of common interest. Thus, a natural set of questions that
Figure 3: User rank by vocabulary size we aim to answer in this section are:
A second finding is that the tagging activity (i.e., number of Q2: Is the tagging community segmented into several sub-
tagging assignments) and library size per user are strongly communities with different interests? Do users cluster around
correlated for both communities (with R2 above 0.97) while the particular items and tags?
correlations between the tagging activity and the vocabulary size To address these questions, we define the interest-sharing graph
is strong for CiteULike (R2 = 0.99), but weaker for Bibsonomy after the intuition of data-sharing graphs introduced by Iamnitchi
(R2 = 0.67). et al. [27]. An interest-sharing graph captures the commonality in
A third finding is that tagging activity distributions are not well user interest for an entire user population: Intuitively, users are
modeled by a Zipf-like distribution. Instead, a Hoerl model [21] connected in the interest-sharing graph if they focus on the same
that extends the power-law family and it is defined by Equation subset of items and/or speak similar language (i.e., share a subset
(1) fits better: of tags).
f(x) = abxxc (1) More formally, consider a graph G = (U, E) where nodes are users
and edges represent the existence of shared interests or activity
Table 2 contains the Hoerl parameters a, b, and c determined via a similarity between users. The rest of this study explores three
curve fitting process for each of the ranking distributions possible definitions for user interest or activity similarity. All
observed. these definitions employ a threshold t for the percentage of items
Table 2: Coefficients determined for the Hoerl function or tags shared between two users:
CiteULike a b c 1) The User-Item similarity definition considers two users’
interests similar if the ratio between the sizes of the
Tag Assignments 9,767.13 0.9979 -0.4754
intersection and the size of the union of their item libraries is
Library Size 2,609.77 0.9988 -0.4772 larger than a threshold t. This is expressed by Equation 2.
Vocabulary Size 3,338.55 0.9992 -0.5964
Bibsonomy
Ik ∩ I j
ekj ∈ E ⇔ User-Item (2)
Tag Assignments 28,969.29 0.9864 -0.6888 Ik ∪ I j
Library Size 6,137.49 0.9850 -0.5461
2) The User-Tag definition is similar to the definition above but
Vocabulary Size 2,608.45 0.9907 -0.5126 considers the vocabularies of the two users rather than their
libraries.
Similar to the Zipf distribution, the Hoerl function has been used
to model a large number of natural phenomena. The most relevant
to collaborative tagging is the use of Hoerl function to describe
Tk ∩ T j
the distribution of bio-diversity across a geographic region
ekj ∈ E ⇔ User-Tag (3)
[22][24]. Considering each user's library a region in a
Tk ∪ T j
3) Unlike the User-Item definition in Equation 2 above, the increases (Note that we exclude isolated nodes from this count of
Directed User-Item considers two users’ interests similar if connected graph components).
the ratio between the intersection of their item libraries and
The plots in Figure 4 show that the number of connected
the size of one user library is larger than a threshold t. The
components increases up to a certain value of our similarity
idea is to explore the role played by users with large libraries
via the introduction of direction to the edges in the graph. threshold. After a certain value of t, the number of connected
components in the graph starts decreasing, since more and more
connected components will contain only one node and will thus
Ik ∩ I j be excluded. The critical threshold value is different for each user
ekj ∈ E ⇔ Directed-User-Item (4)
similarity definition.
Ik
In our analysis of real tag assignment traces from the two tagging
communities, even with low values for the sharing ratio threshold
t, the final graph contains a large number of isolated nodes.
Indeed, by setting the threshold as low as one single item (i.e.,
two users are connected if they share at least one item); we find
that, in CiteULike, 2,672 users (44.87%) are not connected to any
other user. This suggests that a large population of users has
individual preferences.
Figure 5: Total number of nodes in the interest sharing graph
and in the largest component for CiteULike (top) and
Bibsonomy (bottom)
The initial increase in the number of connected components can
be explained by the fact that, as the threshold increases, large
components split to form new islands. Since these islands form
naturally based on user similarity this result is encouraging since
Figure 4: Number of connected components for CiteULike it offers the potential to cluster users according to their interests.
(top) and Bibsonomy (bottom) As t continues to increase the definition of similarity becomes too
strict and leads to more and more isolated nodes.
Figure 4 presents, for the three similarity metrics defined above,
Two observations about the results in Figure 4 can be noted: first,
the number of connected components for both CiteULike and
as the threshold increases, the number of components decreases
Bibsonomy, for thresholds t varying from 1% to 99%. These
faster for the User-Item graph (Equation 2) than for the Directed-
results show that regardless of the graph definition the number of
User-Item graph (Equation 4). This illustrates the effect of using
connected components follow a similar trend as the threshold
an asymmetrical definition for shared interests. The idea explored
here is similar to the friendship graph, where connections distribution gets closer to a uniform random distribution) and,
between two nodes are not reciprocal (e.g., A might consider B to second, as the set becomes larger.
be his friend, but, at the same time, B is indifferent to A) [23].
Similarly, in the directed User-Item graph the size of the user data
set is considered leading to an asymmetrical definition of user
interest (e.g., if all items of A data set are included in B’s, then A
will consider she has shared interests with B, while, depending of
the overall size of his item set B might not consider his interests
are similar enough to be connected to A’s).
Second, there are more isolated nodes (i.e., zero-degree nodes) in
the User-Item and Directed-User-Item graphs than in the User-Tag
graph (defined by Equation 3). This indicates that users tend to
have larger overlaps among their vocabularies than among their
libraries. One reason for this observed interest sharing pattern may
be the fact that users browse and consume items from other users'
libraries without necessarily adding those items to their personal
libraries in the system. An approach to verify this observation is to
perform an analysis of user browsing histories to determine how Figure 6: Entropy growth considering items in CiteULike
often users download items from others’ libraries without adding In practical terms, in a collaborative tagging community, the
them to their personal set of items. From a system design increase in entropy of an item set means that the user needs to
perspective, knowing that users are more likely to share tags than filter out more items to find the one she is interested in. Similarly,
data items may be useful for designing item recommendation high entropy makes it harder to find a tag that describes an item
heuristics based on vocabulary overlap. well. Conversely, lower entropy makes it potentially easier for a
user to reach an item of interest. Thus, the question to be
All the similarity definitions above generally divide the original
answered in this section is the following:
graph into one giant component, several tiny components, and a
large number of isolated nodes. Figure 5 presents the total number Q3: Can the interest-sharing graph be used to reduce the entropy
of nodes in the components with at least two nodes and the perceived by users when navigating through the system?
number of nodes in the largest connected component for
thresholds varying from 1% to 99% for the three similarity Our two-part answer is briefly presented below and detailed in the
measures defined above. rest of this section. First, we demonstrate that the interest-sharing
graph can be used to reduce the entropy perceived by users. To
The results presented in this section demonstrate that using a this end we define a user’s neighborhood as its set of neighbors in
similarity metric and the resulting interest-sharing graph it is the sharing graph and show that this construction can be used to
possible to segment the user population according to manifested present users with an item set with low entropy.
interest. Based on this intuition, we conjecture that it is possible
to build tag/item recommendation mechanisms that exploit usage Second, we offer preliminary results that suggest that this
patterns, i.e., the shared interests among users. The next section segmentation of the user population based on neighborhoods in
offers a preliminary analysis of this hypothesis. the interest-sharing graph has a good predictive power: the items
consumed by a user’s neighbors predict well the future
consumption pattern of that user. Thus, this offers a path to build
7. IMPROVING NAVIGABILITY recommendations systems based on the interest-sharing graph.
Chi and Mytcowicz [2] report that navigability, defined as users’
ability to find relevant content, decreases as a tagging community The rest of this section presents the above two-step exploration in
grows. More precisely, Chi and Mytcowicz imply that the detail. We first define a user’s ‘neighborhood entropy’ as the
decrease in navigability is due to an increase in diversity in the set entropy of the union of the item sets of that user’s neighbors in
of items, users, and tags. the interest-sharing graph. To demonstrate that our group
selection technique is effective in reducing entropy, Figure 7
We have verified that the diversity of the data present in the two compares the average neighborhood entropy in the interest-
communities we study grows over time. Figure 6 presents the sharing graph with two types of other constructions. All results
evolution of entropy, as the metric to evaluate item diversity, for are reported with 95% confidence intervals.
the CiteULike data since November 2004.
Firstly, we compare for CiteULike, the average neighborhood
As a reminder, we note that the entropy of a set of items I, where entropy in the largest connected component of an interest sharing
|I| = N, is defined as: graph (Average Entropy – Figure 7), which is built by using the
N real tagging activity trace, to the total entropy in the system
H ( I ) = −∑ pi ⋅ log( pi ) (5) (almost double at 11.6 as presented in Figure 6). Similarly, the
i =1 neighborhood entropy in the largest connected component is
where pi is the popularity of an item i. compared to the total entropy in the same component (Largest
Component – Figure 7). Secondly, we compare with two random
The entropy of a set may increase in two circumstances: First, as graph constructions: the entropy of a random graph, which is built
the randomness in the item set increases (i.e., the popularity by selecting random neighbors from the set of users in the largest
connected component (Random Component – Figure 7); the 8. CONCLUSIONS & FUTURE WORK
second random graph is built by selecting random from the entire This work presents a characterization of two collaborative tagging
user set (Random Graph - Figure 7). systems, CiteULike and Bibsonomy, as a first step to help extract
implicit information about user attention in collaborative tagging
systems.
First, we analyze the distribution of tagging activity, i.e., the
distribution of the volume of items, tags, and tagging actions
related to each user’ activity in the tagging community. We find
that the activity distribution is highly heterogeneous along all
these multiple axes: a few active users contribute with a large
number of tag assignments and maintain a large number of items
and tags, while the majority of users have a modest tagging
activity.
Additionally, users with large libraries tend to have a large
vocabulary of tags. While this may seem intuitive, this is not the
norm across all tagging systems. In del.icio.us, for example user’s
library size and number of tags used are uncorrelated.
Second, we define the interest-sharing graph and investigate
several definitions for interest similarity based on user activity in
Figure 7: The neighborhood entropy for several sharing ratio terms of items and vocabularies employed. Our main findings can
thresholds in CiteULike using User-Item similarity metric be summarized as follows:
(Equation 2). 1. Both communities present a large population of isolated
We observe that the average ‘neighborhood entropy’ in the users (zero-degree nodes in the interest-sharing graph). This
interest sharing graph is lower for almost all thresholds analyzed: indicates that there are a large number of users with unique
First, the ‘neighborhood entropy’ in the interest sharing graph is preferences. On the other hand, by introducing direction in
between one half and one tenth of the entropy of the entire the graph of shared interests, it is possible to reduce the
dataset. Additionally, for all thresholds analyzed except for the number of isolated nodes. The final main directed connected
1% threshold that does not offer enough discrimination the component contains approximately twice more nodes than
‘neighborhood entropy’ is significantly (24% to 400%) lower than the undirected one.
that of similar random constructions.
2. The structural analysis reveals the existence of a significant
To support our hypothesis that the interest-sharing graph is a good number of small sub-communities of interests totally
basis to develop recommendation systems, we analyze how separated from each other.
efficient the neighbor’s item set in predicting future user attention
over items. To this end, we evaluate the hit ratio: the proportion 3. The structure of the interest-sharing graph can be used to
of items a user adds to her library at time T+1 that are already in reduce the diversity of the items a user is exposed to. To
her neighbor’ libraries at time T. quantify this reduction we compare the entropy of the
‘neighborhood item set’ with that of the item set of the entire
To evaluate the hit ratio, we considered the interest-sharing graph CiteULike/Bibsonomy item set, that of the main connected
based on the User-Item similarity metric with 1% sharing ratio component, and that of various constructions of random item
threshold. Preliminary results show that depending on the sets of similar sizes.
granularity considered (that is the length of our forecasting
period: interval between T and T+1) the hit rate is as high as 20% 4. Finally, we provide preliminary evidence that suggests that
for one hour granularity and decays to a low of 5% for a user’s activity can be predicted by considering the union of
one-month forecast granularity. This indicates that a user’s the item sets of a node’s neighbors in the interest sharing
neighborhood is a possible source of information to predict near graph. We conjecture that this property can be used to build
future user attention and its predictive effectiveness decreases for efficient, online recommendation systems for tagging
longer time intervals. communities.
The findings presented on this section can be summarized as This work inspires a fresh set of questions on collaborative
follows: First, we verify that the entropy increases as the tagging communities and contextualized attention.
CiteULike community grows overtime; Second, we verify that the A possible approach to implicitly extract user attention is to
interest-sharing graph can be used to reduce the entropy of item answer the following question: What are the patterns of
sets presented to users; and finally, we find that a user’s information propagation in collaborative tagging systems? An
‘neighborhood’ in the interest-sharing graph is a possible source answer to this question could build on previous work on
of information to predict future user actions. While the information diffusion in the blogspace [7] by exploring the
preliminary data we present here is encouraging we continue tour evolution of user attention overtime.
exploration for a complete analysis for diverse interest-sharing
graphs based on multiple similarity metrics and thresholds. A second intriguing issue to explore is the following How
malicious behavior affects a tagging system and whether it be
automatically detected? Search results that are manipulated by [12] Cattuto, C., Loreto, V. and Pietronero, L. "Semiotic
tagging misbehavior can have an impact on usage in a dynamics and collaborative tagging". In PNAS, vol. 104,
collaborative tagging community [13]. Automatic detection of no. 5, 1461–1464, January, 2007.
malicious users is paramount to the long term survival of these [13] Koutrika, G., Effendi, F., Gyongyi, Z., Heymann, P., Garcia-
communities. Molina, H. “Combating Tag Spam”. In Proceedings of
Finally, exploring the structure of user interest overtime can help AirWeb Workshop, 2007.
devising models for the formation and evolution of the user [14] Eggenberger, F. and Polya, G. (1923). “Uber die Statistik
similarity graphs, similarly to the study by Kumar et al. [16] on verketteter Vorgange”. Zeitschrift fur Angewandte
online social networks. Mathemathik und Mechanik, 3, 279—289.
[15] Pitkow, J., Schutze, H., Todd, C., Cooley, R., Turnbull, D.,
ACKNOWLEDGMENTS Edmonds, A., Adar, E., Breuel, T. “Personalized search”.
The authors would like to thank Richard Cameron for providing Communications of the ACM (The consumer side of search),
the CiteULike data set; Christoph Schmitz for providing the Volume 45 (9), 50–55, 2002.
Bibsonomy data set; professor Lee Iverson for insightful
[16] Novak, J., Kumar, R. and Tomkins, A. “Structure and
discussions on early stages of this work, and Armin
Evolution of On Line Social Networks”. In Proceedings of
Bahramshahry, Samer Al Kiswany and Nazareno Andrade for
12th ACM SIGKDD, poster, 2006.
their valuable comments. The graph analysis was executed in
parallel using OurGrid (http://www.ourgrid.org). [17] Niwa, S., Doi, T. and Honiden, S. “Web Page Recommender
System based on Folksonomy Mining”. In Proceedings of the
3rd International Conference on Information Technology:
REFERENCES New Generations, 2006.
[1] Lawrence, S. “Context in Web Search”, IEEE Data
Engineering Bulletin, Volume 23, Number 3, pp. 25–32, [18] Li, J. and Zaiane, O. “Combining Usage, Content and
2000. Structure Data to Improve Web Site Recommendation”. In
Proceedings of WebKDD-2004 Workshop on Web Mining
[2] Chi, E. H. and Mytkowicz, T. “Understanding Navigability and Web Usage, 2004.
of Social Tagging Systems”. In Proceedings of CHI'07,
February, 2007. [19] Kazienko, P. and Kiwera, M. “Integration of Relational
Databases and Web Site Content for Product and Page
[3] CiteULike. http://www.citeulike.org, 2007. Recommendation”. In Proceedings of Database Engineering
[4] Bibsonomy. http://www.bibsonomy.org, 2007. and Applications Symposium, 2004.
[5] Golder, S. and Huberman, B. “The Structure of [20] Mathes, A. “Evaluating Collaborative filtering
Collaborative Tagging Systems”. Journal of Information Recommender Systems”. ACM Transactions on Information
Science, Vol. 32, No. 2, 198-208 (2006).. Systems, 2004.
[6] Najjar, J., Wolpers, M., and Duval, E. L. “Attention [21] Hoerl, A. E., “Fitting Curves to Data”. In Chemical
Metadata: Collection and Management”. In Proceedings of Business Handbook, Perry, J. (ed), Section 20, pp. 20--55,
Workshop on Logging Traces of Web Activity: The 1954.
Mechanics of Data Collection, Edinburgh, Scotland, May 23, [22] Lambshead, P. J. D., Brown, C. J., Ferrero, T. J, Hawkins, L.
2006. E., Smith, C. R and Mitchell, N. J. “Biodiversity of
[7] Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A. nematode assemblages from region of the Clarion-
“Information Diffusion through Blogspace”. Proceedings of Clipperton Fracture Zone”. In BioMed Central Ecology,
the 13th International Conference on World Wide Web, 3:1, 2003.
Pages: 491 – 501, 2004. [23] Doreian, P., Batagelj, V. and Ferligoj, A. “Generalized
[8] Zamir, O. E., Korn, J. L., Fikes, A. B., Lawrence, S. R. Blockmodeling”, Cambridge University Press, 2005.
“Personalization of placed content ordering in search [24] Strong, Jr., D. R., McCoy, E. D. and Rey, J. R. “Time and
results”. US-Patent Application 20050240580. the Number of Herbivore Species: The Pests of Sugarcane”.
[9] Haveliwala, T., Jeh, G. M., Kamvar, S. D. “Results based In Ecology, Vol. 58, No. 1. (Jan., 1977), pp. 167-175.
personalization of advertisements in a search engine”. US [25] Flickr. http://www.flickr.com, 2007.
Patent Application 20050222989.
[26] Del.icio.us Blog. http://blog.del.icio.us. September, 2006.
[10] Schimitz, C. "Small World Folksonomies: Clustering in Tri-
Partite Hypergraphs". Univ. Kassel Tech Report, Nov/06. [27] Iamnitchi, A., Ripeanu, M. and Foster, I..“Small-World File-
Sharing Communities”, In INFOCOM, 2004
[11] Watts D. J, Strogatz S. H. "Collective Dynamics of 'small-
world' networks". In Nature, vol. 393, June, 1998.