Tracking Usage in Collaborative Tagging Communities Elizeu Santos-Neto Matei Ripeanu Adriana Iamnitchi Electrical & Computer Engineering Electrical & Computer Engineering Computer Science and Engineering University of British Columbia University of British Columbia University of South Florida 2332 Mail Mall – KAIS 4075 2356 Mail Mall – KAIS 4033 4202 E. Fowler Ave, Tampa, FL +1.604.827.4270 +1.604.822.7281 +1.813.974.5357 elizeus@ece.ubc.ca matei@ece.ubc.ca anda@cse.usf.edu ABSTRACT [16] report that in January 2006 Flickr congregated about one Collaborative tagging has recently attracted the attention of both million users. Similarly, del.icio.us reached one million users in industry and academia due to the popularity of content-sharing September 2006 [26]. systems such as CiteULike, del.icio.us, and Flickr. These systems Although collaborative tagging is attracting increasing attention give users the opportunity to add data items and to attach their from both industry and academia, there are few studies that assess own metadata (i.e., tags) to stored data. The result is an effective the characteristics of communities of users who share and tag content management tool for individual users. Recent studies, content. In particular, little research has been done on the however, suggest that, as tagging communities grow, the added potential benefits of tracking usage patterns in collaborative content and the metadata become harder to manage due to tagging communities. Moreover, recent investigations have shown increased content diversity. Thus, mechanisms that cope with that, as the user population grows, the efficiency of information increase of diversity are fundamental to improve the scalability retrieval based on user generated tags tends to decrease [2]. and usability of collaborative tagging systems. Mining usage patterns is an efficient method to improve the This paper analyzes whether usage patterns can be harnessed to quality of service provided by information retrieval mechanisms improve navigability in a growing knowledge space. To this end, in the web context. For example, usage patterns can be harnessed it presents a characterization of two collaborative tagging to improve ‘browsing experience’ via recommendation systems or communities that target the management of scientific literature: to predict buying patterns and consequently increase revenue of CiteULike and Bibsonomy. We explore three main directions: e-commerce operations [8][9][10][17][18][19][20]. First, we analyze the tagging activity distribution across the user population. Second, we define new metrics for similarity in user This work is motivated by the following conjecture: usage interest and use these metrics to uncover the structure of the patterns can be harnessed to present relevant, contextualized tagging communities we study. The properties of the structure we information and deal with the reduced navigability generated by uncover suggest a clear segmentation of interests into a large informational overload in large tagging communities. number of individuals with unique preferences and a core set of We present encouraging preliminary steps to substantiate the users with interspersed interests. Finally, we offer preliminary above conjecture: We characterize two collaborative tagging results that suggest that the interest-based structure of the tagging systems: (CiteULike [3] and Bibsonomy [4]) as a first step towards community can be used to facilitate content retrieval and a model to represent user interests based on tagging activity. After navigation as communities scale. introducing related work (Section 2), we present a formal definition for tagging communities (Section 3) and the Categories and Subject Descriptors communities and the data sets this study explores (Section 4). We H.1.1 [General]: Systems and Information Theory - Information then characterize tagging activity distribution among users Theory. H.3.5 [Information Storage and Retrieval]: On-line (Section 5) and we investigate the structure of user’s shared Information Services - Web-based services. interests (Section 6). Finally, we present preliminary results on the efficacy of using contextualized attention based on the structure of shared interests to improve the navigability in the system (Section General Terms 7). Section 8 summarizes our findings and outlines future research Measurement, Experimentation, Human Factors, Theory. directions. Keywords 2. RELATED WORK Collaborative Tagging, Usage Patterns, Modeling User Attention, Two types of techniques, implicit and explicit, are traditionally CiteULike, Bibsonomy. used to elicit user preferences in the Web context [1][6][15]. Explicit techniques are based on direct input from a user with 1. INTRODUCTION respect to her preferences and interests (e.g., page rating scales, Collaborative tagging systems are online communities that allow item reviews, categories of interest). Implicit techniques infer a users to assign terms from an uncontrolled vocabulary (i.e., tags) definition of user interests from her activity, e.g., using client-side to items of interest. This simple tagging feature proves to be a or service-side mechanisms such as browser plug-ins, client powerful mechanism for personal knowledge management (e.g., extensions, and server-side logs to track usage patterns. Clearly, in systems like CiteULike [3]) and content sharing (e.g., in each technique has its own advantages and limitations in terms of communities such as Flickr [25]). Recently, collaborative tagging accuracy, cost to the user, privacy control, or ability to adapt to systems have attracted massive user communities: Novak et al. changes on user interests trends. In a tagging community context, the tags themselves can be that, to facilitate browsing through tagging systems, it is interpreted as explicit metadata added by each user. Additionally, increasingly important to take into account user attention in terms observed tagging activity including the volume and frequency of observed tagging activity. with which items are added, the number of tagged items, or tag Niwa et al. [17] propose a recommendation system based on the vocabulary size can be harnessed to extract implicit information. affinity between users and tags, and on the explicit site Due to the youth of collaborative tagging systems, relatively little preferences expressed by the user. Our study differs from this work has been done on tracking usage and exploring work as we use implicit user profiles and propose the use of contextualized user attention in these communities. However, entropy as a metric to characterize their effectiveness. several studies present techniques and models for collecting and Outside the academic area, a number of projects explore the use of managing user attention metadata in the wider web context implicitly-gathered user information. We mention Google's without exploring tagging features [1][6][15]. These techniques initiative to explore users’ past search history to refine the results include post processing of usage logs, tracking user input (e.g. provided by the Page Rank [8][9]. Commercial interest in search terms) and eliciting explicit user preferences. Other contextualized user attention highlights that tracking user investigations are concerned with methods to use contextualized attention and characterizing collective online behavior is not only attention to improve web search [1][15]. an intriguing research topic, but also a potentially attractive As a first step to modeling user attention in tagging communities, business opportunity. it is necessary to characterize collaborative tagging behavior. In this respect, Golder and Huberman [5] study user activity patterns 3. BACKGROUND regarding system utilization and tag usage in del.icio.us – a social A collaborative tagging community allows users to tag items via a bookmarking tool that allows users to share and tag URLs. First, web site. Users interact with the website by searching for items, they observe a low correlation between the number of items in adding new items to the community, or tagging existent items. each user's bookmark list and the number of tags used by each The tagging action performed by a user is generally referred as a user. Next, they discuss the models that could explain this lack of tag assignment. correlation and suggest it is an effect of shared knowledge and imitation in associating tags. Finally, the authors suggest that the For example, in CiteULike and Bibsonomy, each user has a urn model proposed by Eggenberger & Polya [14] is an library, i.e., a set of links to scientific publications and books. appropriate model to derive the evolution of tag usage frequency Each item in the library is associated with a set of terms (tags) on a particular item. assigned by users. It is important to highlight that, in both CiteULike and Bibsonomy, the process of assigning tags to items The urn model can be formulated as follows: consider an urn that is collaborative, in the sense that all users can inspect other users’ contains two colored balls. Iteratively, a ball is drawn at random libraries and assigned tags. User can thus repeat tags used by from the urn and returned to the urn together with a new ball of others to mark a particular item. This is unlike other communities the same color. If this process is repeated a number of times, the (e.g., Flickr) where each user has a fine-grained access control to fraction of balls of a particular color stabilizes. The interesting define who has permissions to see the content and apply tags to it. aspect of this model is that if the process is restarted, this fraction converges to a different number. Golder and Huberman argue that In CiteULike and Bibsonomy users have two options to add items this model captures the evolution of tag proportion observed in to their libraries: the del.icio.us data set. In studies related to contextualized user attention, this model may be valuable to predict future user 1. Browse the content of popular scientific literature portals tagging assignments which can be a useful input to (e.g. ACM Portal, IEEE Explorer, arXiv.org), to add recommendation mechanisms. Golder and Huberman’s study, publications to their own library, and however, is limited in scale: their results on tagging behavior 2. Search for items present in other users' libraries and add dynamics rely on only four days of tracked activity. them to their own library. Other authors follow different approaches to investigate the While posting an item, a user can mark it with terms (i.e., tags) characteristics of tagging systems. Schimtz [10][11] studies that can be used for future retrieval. The collaborative nature of structural properties of del.icio.us and Bibsonomy, uses a tri- tagging relies on the fact that users potentially share interests and partite hypergraph representation, and adapts the small-world use similar items and tags. Thus, while the tagging activity of one pattern definitions to this representation. Cattuto et al. [12] model user may be self-centered the set of tags used may facilitate the usage behavior via unipartite projections from a tripartite graph. job of other users in finding content of interest. Our approach differs from these studies in terms of scale and in the use of dynamic metrics to define shared user interest: we We represent a collaborative tagging community by the tuple: define metrics that scale as the community grows and/or user C=(U,I,T,A), where U represents the set of users, I is the set of activity increases (Section 6). items, T is the tagging vocabulary, and A the set of tag assignments. By analyzing del.icio.us, Chi and Mytkoswicz [2] find that the efficiency of social tagging decreases as the communities grow: The set of tag assignments is denoted by A = {(u, t, p) | u ∈ U, t ∈ that is, tags are becoming less and less descriptive and T, p∈ I}. From this definition of tag assignments, we can derive consequently it becomes harder to find a particular item using the definition of an individual user, item and tag, as follows: them. Simultaneously, it becomes harder to find tags that efficiently mark an item for future retrieval. These results indicate  A user uk ∈ U is denoted by a pair uk = (Ik, Tk), where Ik is the approximately 14% of the original data set, while the users set of items user k has ever tagged. Thus, an item p ∈ Ik if and removed from Bibsonomy are around 0.6% of the original data set. Table 1 summarizes the characteristics of each data set after only if ∃ (uk,t, p) ∈ A, for any t ∈ Tk. Similarly, Tk is the set of the data cleaning operation. tags user uk applied before, where t ∈ Tk if and only if ∃ (uk, t, p) ∈ A. 5. TAGGING ACTIVITY To gain an understanding on the usage patterns in these two  An item pi ∈ I is denoted by pi = (Ui , Ti ), where Ui is the set of communities, we start by evaluating the activity levels along users who tagged this item, and Ti is set of tags this item has several metrics: the number of items per user, number of tagging received. assignments performed, and number of tags used. The question answered in this section is the following:  A tag tj ∈ T is denoted by tj = (Uj , Ij ), where Uj is the set of users who used the tag tj before, and Ik is the set of items Q1: How is the tagging activity distributed among users? annotated with the tag tj. We aim to quantify the volume of user interaction with the system, either by adding new content to the community, or by 4. DATA SETS AND DATA CLEANING tagging an existing item. Intuitively, one would expect that a few Both tagging communities we analyze: CiteULike [3] and users are very active while the majority rarely interacts with the Bibsonomy [4], aim to improve user’s organization and community. management of research publications. Both provide functionality Determining how often users perform tag assignments is to import and export citation records in formats like BibTeX, for important to help designing systems that track user attention. For example. example, in a context where activity information is used to The data sets analyzed in this article were provided by the recommend new items based on tag similarity, it can be necessary administrators of the respective web sites. Thus, the data to compute the similarity level at the same rate as the rate with represents a global snapshot of each system within the period which new information is added into the system. Figure 1 presents determined by the timestamps in the traces we have obtained the user rank according to the number of tag assignments (Table 1). It is important to point out that the Bibsonomy data set performed during the time frame of our data set. In the results that has timestamps starting at 1995, which we considered a bug. follow, we present the data points observed together with a curve Moreover, Bibsonomy has two separate datasets, scientific that provides a good model to the observed data (i.e. Hoerl literature and URL bookmarks. We concentrated our analysis on function [21]). At the end of this section we comment more on the the scientific literature part of the data. characteristics of this curve. In the original CiteULike data set, the most popular tag is “bibtex- import” while the second most popular tag is “no-tag”, automatically assigned when a user does not assign any tag to a new item. The popularity of these two tags indicates that a large part of users use CiteULike as a tool to convert their list of citations to BibTex format, and that users tend not to tag items at the time they post a new item to their individual libraries. Clearly, this is relevant information for system designers who might want to invest effort in improving the features of most interest. Figure 1: User rank based on the number of tag assignments. Also, in CiteULike one user posted and tagged more than 3,000 Note the logarithmic scales on both axes. items within approximately 5 minutes (according to the A second metric for tagging activity is the size of user libraries. timestamps in the data set). Obviously, this behavior is due to an Figure 2 plots user library size for users ranked in decreasing automatic mechanism. order according to the size of their libraries for CiteULike and Table 1: Summary of cleaned data sets used in this study Bibsonomy, respectively. This shows the size of the set of items a particular user pays attention to. The results confirm that the users CiteULike Bibsonomy in these two systems are heterogeneous in terms of activity Period 11/2004—04/2006 ??—12/2006 intensity, as it has already been indicated by the tag assignment # Users (|U|) 5,954 656 activity. # Items (|I|) 199,512 67,034 # Tags (|T|) 51,079 21,221 # Assignments (|A|) 451,980 257,261 Our objective is to concentrate only on those users who are using the system interactively to bookmark and share articles. Consequently, for the analysis that follows, we have the “robot” user (i.e., a user with 3,000 items tagged within 5 minutes) and users who used only the tags bibtex-import and/or no-tag. The total number of users removed from CiteULike represents Figure 2: User rank based on the library size. The correlation between a user’s library size and her vocabulary is collaborative tagging community, one may draw a comparison important to understand whether the diversity of the vocabulary between the potential diversity found in the users' library used by each user grows with the number of items in her personal regarding the number of items in it, and the bio-diversity library. We observe that, in both communities the users’ library distribution across geographic regions. and vocabulary sizes are strongly correlated for CiteULike Although a Hoerl function is a good fit for the activity (R2 = 0.98, n = 5954) and less strongly, but still positively distributions, this does not directly imply that diversity of user correlated, for Bibsonomy (R2 = 0.80, n = 654). Although such libraries or vocabularies represents a phenomenon which is correlation may seem intuitive, since users with a more diverse set similar to those presented by studies on biodiversity. of items would need more tags to describe them, this behavior is Nevertheless, the Hoerl function does provide a good model for different from that observed by Golder and Huberman in collaborative tagging activity and it can be useful to study user del.icio.us [5]. A possible explanation is that in del.icio.us user is diversity in collaborative tagging systems in the future. presented with tag suggestions based on past tagging activity when adding a new bookmark. These suggestions may bias and To summarize: in the communities we study, the intensity of user limit the size of a user vocabulary. However, further investigation activity is distributed over multiple orders of magnitude, it is well is necessary to assess how a user vocabulary is affected by tagging modeled using the Hoerl function and, unlike in other recommendation. communities, there is a strong correlation in activity in terms of items set and vocabulary sizes. 6. EVALUATING USER SIMILARITY While the analysis above is important for an overall usage profile evaluation of each community, it provides little information about user interests. Assessing the commonality in user interests is important for identifying user groups that may form around content of common interest. Thus, a natural set of questions that Figure 3: User rank by vocabulary size we aim to answer in this section are: A second finding is that the tagging activity (i.e., number of Q2: Is the tagging community segmented into several sub- tagging assignments) and library size per user are strongly communities with different interests? Do users cluster around correlated for both communities (with R2 above 0.97) while the particular items and tags? correlations between the tagging activity and the vocabulary size To address these questions, we define the interest-sharing graph is strong for CiteULike (R2 = 0.99), but weaker for Bibsonomy after the intuition of data-sharing graphs introduced by Iamnitchi (R2 = 0.67). et al. [27]. An interest-sharing graph captures the commonality in A third finding is that tagging activity distributions are not well user interest for an entire user population: Intuitively, users are modeled by a Zipf-like distribution. Instead, a Hoerl model [21] connected in the interest-sharing graph if they focus on the same that extends the power-law family and it is defined by Equation subset of items and/or speak similar language (i.e., share a subset (1) fits better: of tags). f(x) = abxxc (1) More formally, consider a graph G = (U, E) where nodes are users and edges represent the existence of shared interests or activity Table 2 contains the Hoerl parameters a, b, and c determined via a similarity between users. The rest of this study explores three curve fitting process for each of the ranking distributions possible definitions for user interest or activity similarity. All observed. these definitions employ a threshold t for the percentage of items Table 2: Coefficients determined for the Hoerl function or tags shared between two users: CiteULike a b c 1) The User-Item similarity definition considers two users’ interests similar if the ratio between the sizes of the Tag Assignments 9,767.13 0.9979 -0.4754 intersection and the size of the union of their item libraries is Library Size 2,609.77 0.9988 -0.4772 larger than a threshold t. This is expressed by Equation 2. Vocabulary Size 3,338.55 0.9992 -0.5964 Bibsonomy Ik ∩ I j ekj ∈ E ⇔ User-Item (2) Tag Assignments 28,969.29 0.9864 -0.6888 Ik ∪ I j Library Size 6,137.49 0.9850 -0.5461 2) The User-Tag definition is similar to the definition above but Vocabulary Size 2,608.45 0.9907 -0.5126 considers the vocabularies of the two users rather than their libraries. Similar to the Zipf distribution, the Hoerl function has been used to model a large number of natural phenomena. The most relevant to collaborative tagging is the use of Hoerl function to describe Tk ∩ T j the distribution of bio-diversity across a geographic region ekj ∈ E ⇔ User-Tag (3) [22][24]. Considering each user's library a region in a Tk ∪ T j 3) Unlike the User-Item definition in Equation 2 above, the increases (Note that we exclude isolated nodes from this count of Directed User-Item considers two users’ interests similar if connected graph components). the ratio between the intersection of their item libraries and The plots in Figure 4 show that the number of connected the size of one user library is larger than a threshold t. The components increases up to a certain value of our similarity idea is to explore the role played by users with large libraries via the introduction of direction to the edges in the graph. threshold. After a certain value of t, the number of connected components in the graph starts decreasing, since more and more connected components will contain only one node and will thus Ik ∩ I j be excluded. The critical threshold value is different for each user ekj ∈ E ⇔ Directed-User-Item (4) similarity definition. Ik In our analysis of real tag assignment traces from the two tagging communities, even with low values for the sharing ratio threshold t, the final graph contains a large number of isolated nodes. Indeed, by setting the threshold as low as one single item (i.e., two users are connected if they share at least one item); we find that, in CiteULike, 2,672 users (44.87%) are not connected to any other user. This suggests that a large population of users has individual preferences. Figure 5: Total number of nodes in the interest sharing graph and in the largest component for CiteULike (top) and Bibsonomy (bottom) The initial increase in the number of connected components can be explained by the fact that, as the threshold increases, large components split to form new islands. Since these islands form naturally based on user similarity this result is encouraging since Figure 4: Number of connected components for CiteULike it offers the potential to cluster users according to their interests. (top) and Bibsonomy (bottom) As t continues to increase the definition of similarity becomes too strict and leads to more and more isolated nodes. Figure 4 presents, for the three similarity metrics defined above, Two observations about the results in Figure 4 can be noted: first, the number of connected components for both CiteULike and as the threshold increases, the number of components decreases Bibsonomy, for thresholds t varying from 1% to 99%. These faster for the User-Item graph (Equation 2) than for the Directed- results show that regardless of the graph definition the number of User-Item graph (Equation 4). This illustrates the effect of using connected components follow a similar trend as the threshold an asymmetrical definition for shared interests. The idea explored here is similar to the friendship graph, where connections distribution gets closer to a uniform random distribution) and, between two nodes are not reciprocal (e.g., A might consider B to second, as the set becomes larger. be his friend, but, at the same time, B is indifferent to A) [23]. Similarly, in the directed User-Item graph the size of the user data set is considered leading to an asymmetrical definition of user interest (e.g., if all items of A data set are included in B’s, then A will consider she has shared interests with B, while, depending of the overall size of his item set B might not consider his interests are similar enough to be connected to A’s). Second, there are more isolated nodes (i.e., zero-degree nodes) in the User-Item and Directed-User-Item graphs than in the User-Tag graph (defined by Equation 3). This indicates that users tend to have larger overlaps among their vocabularies than among their libraries. One reason for this observed interest sharing pattern may be the fact that users browse and consume items from other users' libraries without necessarily adding those items to their personal libraries in the system. An approach to verify this observation is to perform an analysis of user browsing histories to determine how Figure 6: Entropy growth considering items in CiteULike often users download items from others’ libraries without adding In practical terms, in a collaborative tagging community, the them to their personal set of items. From a system design increase in entropy of an item set means that the user needs to perspective, knowing that users are more likely to share tags than filter out more items to find the one she is interested in. Similarly, data items may be useful for designing item recommendation high entropy makes it harder to find a tag that describes an item heuristics based on vocabulary overlap. well. Conversely, lower entropy makes it potentially easier for a user to reach an item of interest. Thus, the question to be All the similarity definitions above generally divide the original answered in this section is the following: graph into one giant component, several tiny components, and a large number of isolated nodes. Figure 5 presents the total number Q3: Can the interest-sharing graph be used to reduce the entropy of nodes in the components with at least two nodes and the perceived by users when navigating through the system? number of nodes in the largest connected component for thresholds varying from 1% to 99% for the three similarity Our two-part answer is briefly presented below and detailed in the measures defined above. rest of this section. First, we demonstrate that the interest-sharing graph can be used to reduce the entropy perceived by users. To The results presented in this section demonstrate that using a this end we define a user’s neighborhood as its set of neighbors in similarity metric and the resulting interest-sharing graph it is the sharing graph and show that this construction can be used to possible to segment the user population according to manifested present users with an item set with low entropy. interest. Based on this intuition, we conjecture that it is possible to build tag/item recommendation mechanisms that exploit usage Second, we offer preliminary results that suggest that this patterns, i.e., the shared interests among users. The next section segmentation of the user population based on neighborhoods in offers a preliminary analysis of this hypothesis. the interest-sharing graph has a good predictive power: the items consumed by a user’s neighbors predict well the future consumption pattern of that user. Thus, this offers a path to build 7. IMPROVING NAVIGABILITY recommendations systems based on the interest-sharing graph. Chi and Mytcowicz [2] report that navigability, defined as users’ ability to find relevant content, decreases as a tagging community The rest of this section presents the above two-step exploration in grows. More precisely, Chi and Mytcowicz imply that the detail. We first define a user’s ‘neighborhood entropy’ as the decrease in navigability is due to an increase in diversity in the set entropy of the union of the item sets of that user’s neighbors in of items, users, and tags. the interest-sharing graph. To demonstrate that our group selection technique is effective in reducing entropy, Figure 7 We have verified that the diversity of the data present in the two compares the average neighborhood entropy in the interest- communities we study grows over time. Figure 6 presents the sharing graph with two types of other constructions. All results evolution of entropy, as the metric to evaluate item diversity, for are reported with 95% confidence intervals. the CiteULike data since November 2004. Firstly, we compare for CiteULike, the average neighborhood As a reminder, we note that the entropy of a set of items I, where entropy in the largest connected component of an interest sharing |I| = N, is defined as: graph (Average Entropy – Figure 7), which is built by using the N real tagging activity trace, to the total entropy in the system H ( I ) = −∑ pi ⋅ log( pi ) (5) (almost double at 11.6 as presented in Figure 6). Similarly, the i =1 neighborhood entropy in the largest connected component is where pi is the popularity of an item i. compared to the total entropy in the same component (Largest Component – Figure 7). Secondly, we compare with two random The entropy of a set may increase in two circumstances: First, as graph constructions: the entropy of a random graph, which is built the randomness in the item set increases (i.e., the popularity by selecting random neighbors from the set of users in the largest connected component (Random Component – Figure 7); the 8. CONCLUSIONS & FUTURE WORK second random graph is built by selecting random from the entire This work presents a characterization of two collaborative tagging user set (Random Graph - Figure 7). systems, CiteULike and Bibsonomy, as a first step to help extract implicit information about user attention in collaborative tagging systems. First, we analyze the distribution of tagging activity, i.e., the distribution of the volume of items, tags, and tagging actions related to each user’ activity in the tagging community. We find that the activity distribution is highly heterogeneous along all these multiple axes: a few active users contribute with a large number of tag assignments and maintain a large number of items and tags, while the majority of users have a modest tagging activity. Additionally, users with large libraries tend to have a large vocabulary of tags. While this may seem intuitive, this is not the norm across all tagging systems. In del.icio.us, for example user’s library size and number of tags used are uncorrelated. Second, we define the interest-sharing graph and investigate several definitions for interest similarity based on user activity in Figure 7: The neighborhood entropy for several sharing ratio terms of items and vocabularies employed. Our main findings can thresholds in CiteULike using User-Item similarity metric be summarized as follows: (Equation 2). 1. Both communities present a large population of isolated We observe that the average ‘neighborhood entropy’ in the users (zero-degree nodes in the interest-sharing graph). This interest sharing graph is lower for almost all thresholds analyzed: indicates that there are a large number of users with unique First, the ‘neighborhood entropy’ in the interest sharing graph is preferences. On the other hand, by introducing direction in between one half and one tenth of the entropy of the entire the graph of shared interests, it is possible to reduce the dataset. Additionally, for all thresholds analyzed except for the number of isolated nodes. The final main directed connected 1% threshold that does not offer enough discrimination the component contains approximately twice more nodes than ‘neighborhood entropy’ is significantly (24% to 400%) lower than the undirected one. that of similar random constructions. 2. The structural analysis reveals the existence of a significant To support our hypothesis that the interest-sharing graph is a good number of small sub-communities of interests totally basis to develop recommendation systems, we analyze how separated from each other. efficient the neighbor’s item set in predicting future user attention over items. To this end, we evaluate the hit ratio: the proportion 3. The structure of the interest-sharing graph can be used to of items a user adds to her library at time T+1 that are already in reduce the diversity of the items a user is exposed to. To her neighbor’ libraries at time T. quantify this reduction we compare the entropy of the ‘neighborhood item set’ with that of the item set of the entire To evaluate the hit ratio, we considered the interest-sharing graph CiteULike/Bibsonomy item set, that of the main connected based on the User-Item similarity metric with 1% sharing ratio component, and that of various constructions of random item threshold. Preliminary results show that depending on the sets of similar sizes. granularity considered (that is the length of our forecasting period: interval between T and T+1) the hit rate is as high as 20% 4. Finally, we provide preliminary evidence that suggests that for one hour granularity and decays to a low of 5% for a user’s activity can be predicted by considering the union of one-month forecast granularity. This indicates that a user’s the item sets of a node’s neighbors in the interest sharing neighborhood is a possible source of information to predict near graph. We conjecture that this property can be used to build future user attention and its predictive effectiveness decreases for efficient, online recommendation systems for tagging longer time intervals. communities. The findings presented on this section can be summarized as This work inspires a fresh set of questions on collaborative follows: First, we verify that the entropy increases as the tagging communities and contextualized attention. CiteULike community grows overtime; Second, we verify that the A possible approach to implicitly extract user attention is to interest-sharing graph can be used to reduce the entropy of item answer the following question: What are the patterns of sets presented to users; and finally, we find that a user’s information propagation in collaborative tagging systems? An ‘neighborhood’ in the interest-sharing graph is a possible source answer to this question could build on previous work on of information to predict future user actions. While the information diffusion in the blogspace [7] by exploring the preliminary data we present here is encouraging we continue tour evolution of user attention overtime. exploration for a complete analysis for diverse interest-sharing graphs based on multiple similarity metrics and thresholds. A second intriguing issue to explore is the following How malicious behavior affects a tagging system and whether it be automatically detected? Search results that are manipulated by [12] Cattuto, C., Loreto, V. and Pietronero, L. "Semiotic tagging misbehavior can have an impact on usage in a dynamics and collaborative tagging". In PNAS, vol. 104, collaborative tagging community [13]. Automatic detection of no. 5, 1461–1464, January, 2007. malicious users is paramount to the long term survival of these [13] Koutrika, G., Effendi, F., Gyongyi, Z., Heymann, P., Garcia- communities. Molina, H. “Combating Tag Spam”. In Proceedings of Finally, exploring the structure of user interest overtime can help AirWeb Workshop, 2007. devising models for the formation and evolution of the user [14] Eggenberger, F. and Polya, G. (1923). “Uber die Statistik similarity graphs, similarly to the study by Kumar et al. [16] on verketteter Vorgange”. Zeitschrift fur Angewandte online social networks. Mathemathik und Mechanik, 3, 279—289. [15] Pitkow, J., Schutze, H., Todd, C., Cooley, R., Turnbull, D., ACKNOWLEDGMENTS Edmonds, A., Adar, E., Breuel, T. “Personalized search”. The authors would like to thank Richard Cameron for providing Communications of the ACM (The consumer side of search), the CiteULike data set; Christoph Schmitz for providing the Volume 45 (9), 50–55, 2002. Bibsonomy data set; professor Lee Iverson for insightful [16] Novak, J., Kumar, R. and Tomkins, A. “Structure and discussions on early stages of this work, and Armin Evolution of On Line Social Networks”. In Proceedings of Bahramshahry, Samer Al Kiswany and Nazareno Andrade for 12th ACM SIGKDD, poster, 2006. their valuable comments. The graph analysis was executed in parallel using OurGrid (http://www.ourgrid.org). [17] Niwa, S., Doi, T. and Honiden, S. “Web Page Recommender System based on Folksonomy Mining”. In Proceedings of the 3rd International Conference on Information Technology: REFERENCES New Generations, 2006. [1] Lawrence, S. “Context in Web Search”, IEEE Data Engineering Bulletin, Volume 23, Number 3, pp. 25–32, [18] Li, J. and Zaiane, O. “Combining Usage, Content and 2000. Structure Data to Improve Web Site Recommendation”. In Proceedings of WebKDD-2004 Workshop on Web Mining [2] Chi, E. H. and Mytkowicz, T. “Understanding Navigability and Web Usage, 2004. of Social Tagging Systems”. In Proceedings of CHI'07, February, 2007. [19] Kazienko, P. and Kiwera, M. “Integration of Relational Databases and Web Site Content for Product and Page [3] CiteULike. http://www.citeulike.org, 2007. Recommendation”. In Proceedings of Database Engineering [4] Bibsonomy. http://www.bibsonomy.org, 2007. and Applications Symposium, 2004. [5] Golder, S. and Huberman, B. “The Structure of [20] Mathes, A. “Evaluating Collaborative filtering Collaborative Tagging Systems”. Journal of Information Recommender Systems”. ACM Transactions on Information Science, Vol. 32, No. 2, 198-208 (2006).. Systems, 2004. [6] Najjar, J., Wolpers, M., and Duval, E. L. “Attention [21] Hoerl, A. E., “Fitting Curves to Data”. In Chemical Metadata: Collection and Management”. In Proceedings of Business Handbook, Perry, J. (ed), Section 20, pp. 20--55, Workshop on Logging Traces of Web Activity: The 1954. Mechanics of Data Collection, Edinburgh, Scotland, May 23, [22] Lambshead, P. J. D., Brown, C. J., Ferrero, T. J, Hawkins, L. 2006. E., Smith, C. R and Mitchell, N. J. “Biodiversity of [7] Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A. nematode assemblages from region of the Clarion- “Information Diffusion through Blogspace”. Proceedings of Clipperton Fracture Zone”. In BioMed Central Ecology, the 13th International Conference on World Wide Web, 3:1, 2003. Pages: 491 – 501, 2004. [23] Doreian, P., Batagelj, V. and Ferligoj, A. “Generalized [8] Zamir, O. E., Korn, J. L., Fikes, A. B., Lawrence, S. R. Blockmodeling”, Cambridge University Press, 2005. “Personalization of placed content ordering in search [24] Strong, Jr., D. R., McCoy, E. D. and Rey, J. R. “Time and results”. US-Patent Application 20050240580. the Number of Herbivore Species: The Pests of Sugarcane”. [9] Haveliwala, T., Jeh, G. M., Kamvar, S. D. “Results based In Ecology, Vol. 58, No. 1. (Jan., 1977), pp. 167-175. personalization of advertisements in a search engine”. US [25] Flickr. http://www.flickr.com, 2007. Patent Application 20050222989. [26] Del.icio.us Blog. http://blog.del.icio.us. September, 2006. [10] Schimitz, C. "Small World Folksonomies: Clustering in Tri- Partite Hypergraphs". Univ. Kassel Tech Report, Nov/06. [27] Iamnitchi, A., Ripeanu, M. and Foster, I..“Small-World File- Sharing Communities”, In INFOCOM, 2004 [11] Watts D. J, Strogatz S. H. "Collective Dynamics of 'small- world' networks". In Nature, vol. 393, June, 1998.