Leveraging User-Interactions for Time-Aware Tag Recommendations Daniel Zoller Stephan Doerfel University of Würzburg Micromata GmbH Data Mining and Information Retrieval (DMIR) Group Kassel zoller@informatik.uni-wuerzburg.de s.doerfel@micromata.de Christian Pölitz Andreas Hotho University of Würzburg University of Würzburg Data Mining and Information Retrieval (DMIR) Group Data Mining and Information Retrieval (DMIR) Group poelitz@informatik.uni-wuerzburg.de L3S Research Center hotho@informatik.uni-wuerzburg.de ABSTRACT taking all of a user’s previous tag assignments into account. Next to For the popular task of tag recommendation, various (complex) its strong performance, the method has another significant advan- approaches have been proposed. Recently however, research has tage over comparably effective methods: The recommendations can focused on heuristics with low computational effort and particularly, be computed online – no offline training is required. The latter can a time-aware heuristic, called BLL, has been shown to compare well be a decisive factor for system operators, especially where tagging to various state-of-the-art methods. Here, we follow up on these is only a secondary feature and expensive offline training seems out results by presenting another time-aware approach leveraging user- of proportion due to the amount of required resources (hardware, interaction data in an easily interpretable, on-the-fly computable time, and expertise). approach that can successfully be combined with BLL. In this paper, we follow the incentive to build such lightweight We investigate the influence of time as a parameter in that but effective methods. To that end, we use BLL and augment it approach, and we demonstrate the effectiveness of the proposed with a time-aware heuristic relying on a new source of data, that is method using two datasets from the popular public social tagging usually available in all tagging systems and can be exploited easily system BibSonomy. and on the fly: the interactions of users with the system. We exploit the context of requested pages in a tagging system to generate KEYWORDS tags as recommendation candidates. We demonstrate that for the selection of those requests, time is a critical factor. We evaluate tag recommendation; user behavior; time-aware the success of our approach on a dataset of the public real-world tagging system BibSonomy. 1 INTRODUCTION Contributions: The main contributions of this investigation are: Tagging allows users to easily organize and share resources online. Users add keywords, called “tags”, to resources and store them (1) We describe a heuristic to harvest tag recommendation together, thus enabling themselves and others to retrieve these candidates from user interactions with the system before resources using those tags as cues. Tagging is the basis of social posting a resource. bookmarking systems, such as Delicious,1 BibSonomy,2 or Flickr.3 (2) We test two variants of a tag recommender, that utilize the However, it has also long found its way as a secondary feature into extracted recommendation candidates, one using a fix num- many applications, like web shops, wikis, blogs, or libraries. ber of the most recent interactions and one using specific The task of recommending tags has proven to be a fruitful line of time-windows over which interactions are considered. research. Many approaches (see Section 3) utilize complex or com- (3) We evaluate the overall gain in performance for recom- putationally costly models, often relying on the full data collected mendations of the interaction-based heuristics, as well as in the system. However, [10] already showed that simple and fast a hybridized version in combination with BLL. popularity-based heuristics can achieve comparably good results as more difficult and more expensive state-of-the-art methods. Our results demonstrate that the time-aware approach provides Recently, [19] confirmed this tendency in a large-scale experi- better suggestions than the variant using a fixed number of previ- ment on multiple datasets, using a time-aware heuristic that trans- ous interactions. Furthermore, a hybrid combining the time-aware, fers the base-level learning (BLL) equation to social tagging. BLL interaction-based recommender with BLL outperforms plain BLL. scores a tag based on the time that has passed since it was last used, Moreover (like plain BLL), our proposed recommender is indepen- dent of the tagged resources (making no use of their contents) and 1 https://del.icio.us (for storing web links) 2 https://www.bibsonomy.org (for storing web links and publications) respects the requirement of easy, online computability. We expect 3 https://www.flickr.com (for storing images) our approach to be relevant for the tag recommender community, as well as for operators of web systems who want to support their RecTemp Workshop @ ACM RecSys ’17, Como, Italy users with tag recommendations without spending effort and re- Copyright © 2017 for this paper by its authors. Copying permitted for private and sources on costly optimization procedures. academic purposes. RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy Daniel Zoller, Stephan Doerfel, Christian Pölitz, and Andreas Hotho The remainder of this paper is structured as follows: First, we much) smaller than the complete list of a user’s previously used introduce our recommender (Section 2) that leverages user inter- tags, let alone the complete historical posting data in the system. action in a tagging system and the incorporation of the aspect of time. Then, we discuss related work in Section 3. We describe the 2.2 Hybrid with BLL dataset and experimental setup in Section 4. Next, we present our Both proposed variants of the tag recommender cannot provide evaluation results for time-based and user interaction-based recom- results for new users that post resources immediately after regis- menders in Section 5. Finally, we conclude the paper in Section 7, tration without interacting with the system. Also, it is not unusual after discussing our findings in Section 6. that users do not interact with the system before they store a new resource. This fact also impedes the time-aware variant LRt . To 2 RECOMMENDERS compensate, we combine each of the interaction-based methods Tag recommenders provide recommendations during the process with BLL into a hybrid recommender. BLL has been shown to out- of posting a resource. When a user u is posting a resource r , the perform other (time-dependent) approaches (see Section 3). It uses recommender will provide a list of tags that u might want to assign all tags previously used by the active user and computes a recency- to r . For that task, we propose and describe a new time-aware, based ranking on them: Let Tu be the set of tags previously used interaction-driven recommender which exploits the previous inter- by user u and timep(p) the timestamp when user u stored post p. actions of the active user (the one to provide recommendations for) Further, let Yt,u be the set of tag assignments for tag t of the user u with the tagging system in this section. Lastly, we describe a hybrid (i.e., we add a tuple (u, r , t) every time a user u annotates a resource approach, combining our approach with BLL (Trattner et al. [19]). r with tag t to the corresponding set), and time(y) be the timestamp of the tag assignment y ∈ Yt,u . The BLL-score of each tag in Tu 2.1 Time-Aware Interaction-Driven is calculated as ln( y ∈Yt,u (timep(p) − time(y))−d ) and normalized Í Recommenders by the softmax function over all scores. We set d = 0.5 for our In tagging systems, and more so in systems where tagging is a evaluation – the setting which obtained the best results in [19]. secondary feature, a user’s interactions with the system are much Our interaction-based recommender approaches are combined more diverse than just adding and tagging new resources. Still, tag with BLL in the following way: First, the interaction-based method recommendation approaches commonly focus only on the result is used to compute a ranked list L 1 of candidates (possibly of length of these interactions (the tagged resources) to compute candidates 0). Similarly, BLL also yields a ranked list L 2 of candidate tags. From to recommend. In this paper, we propose going beyond only the L 2 , we remove all tags that also occur in L 1 . Then the two lists are tagging activities and consider the context of all interactions that a concatenated, such that the final ranked list of recommendations user has with the system. To generate meaningful recommendations contains the suggestions from L 1 followed by those from L 2 . We for a resource r , we focus especially on the most recent interactions, denote the two resulting hybrids by LRt + BLL and LRn + BLL. as they are likely to belong to the same context as r . The rationale behind the approach is, that by browsing and searching resources, 3 RELATED WORK users reveal their current interests, which are useful to select fitting Recommending tags can serve various purposes, such as increasing tags. A user’s interactions occur in the requests she sends to the the chance of getting a resource annotated, reminding a user what system. Therefore, we call our recommender approach Last Requests a resource is about, and consolidating the vocabulary across users. (LR). To select requests from which candidate tags are computed Furthermore, as Sood et al. [18] pointed out, tag recommenders we distinguish two options: First, a time-window based variant lower the effort of annotation by changing the process from a LRt : When user u stores a new resource r , recommendations are generation to a recognition task: rather than “inventing” tags, the computed using all previous requests that u made during the time user only needs to select some of the recommended tags. frame of length d immediately preceding the current posting of r . Since the emergence of social bookmarking, the topic of tag rec- In the second variant, called LRn , only the last n interactions of u ommendations has raised considerable interest among researchers. are considered. While the first variant is explicitly time-aware and As for all recommender domains, tag recommender algorithms can thus more likely to capture interactions relevant in the context of roughly be classified into three classes: Content-based algorithms the active post, the second variant is more broadly applicable, as use the content of resources, for instance to compute similarities users do not necessarily interact with the system prior to posting a between items and to present items that are similar to the ones the new resource (e.g., when using a posting tool, like a bookmarklet active user previously liked. Collaborative algorithms make use of or browser add-on). In such a case, older but perhaps still relevant the relations between the users and the items, for instance by iden- information could be used. tifying similar users and suggesting items similar users liked. The In both variants, recommendation candidates are derived from third class are algorithms that exploit both data sources, sometimes each of the considered interactions. To yield a ranked list of rec- called hybrid recommenders. A major drawback of content-based ommendations, the resulting tags are ordered by their frequency approaches is that the usability of a resource’s content depends among the interactions.4 We describe methods for deriving recom- on the type of resource. For example, when the tagged resources mendable tags from interactions in Section 4.3. Both recommenders are textual, using words from those resources (e.g., from the title, can compute the candidates more efficiently than other recom- like Lipczak et al. [11] did) can be successful. However, the same menders that leverage all previously used tags of a user, because is much harder when the resources are images. Moreover, even they only compute frequencies on a set of tags which is (usually for textual resources, the implementation of the recommendation 4 In the case of ties, we return the tags in lexicographic order. (e.g., the word selection strategy) is resource dependent. Thus, since Leveraging User-Interactions for Time-Aware Tag Recommendations RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy our goal is to create a simple and highly versatile recommender, no extra data structures nor precomputed values (see Section 2.2), we focus only on the second class of recommender algorithms – we use this approach as our baseline. Our method is similar to that those that exploit the folksonomy structure between users, tags, of Yin et al. [20], however, instead of using tags of the previous and resources, which exists in any tagging system. posts, we use tags extracted from previous user interactions with An evaluation of collaborative algorithms, such as collaborative the tagging system. We will show that the time-frame for collecting filtering, the FolkRank algorithm [9], and simpler, popularity-based such interactions is critical and that time-frames of less than a day methods was performed in [10] on various datasets. FolkRank out- are worth considering. performed the other methods. However, the hybrid heuristic based recommender, that combined users’ frequently used tags with tags 4 EXPERIMENTAL SETUP that were frequently used to annotate the resource, was second. In this section, we introduce the tagging system BibSonomy, of Rendle and Schmidt-Thieme [16] produced recommendations with which we use data for our study. We further describe the datasets a statistical method based on factor models. They factorized the with all preprocessing steps, the experiments and their evaluation. folksonomy structure to find latent interactions between users, re- sources and tags. Using a variant of the stochastic gradient descent 4.1 BibSonomy algorithm, the authors optimized an adaptation of the Bayesian BibSonomy [1] allows users to collect references to (scientific) pub- Personal Ranking criterion [15]. Seitlinger et al. [17] proposed an lications and bookmarks to websites, and to annotate these with approach that simulates human category learning in a three-layer arbitrary keywords, so called tags. While entering the metadata connectionist network. In the input layer, Latent Dirichlet Allo- of a resource in a form, the user can also enter tags, which she cation is used to characterize the resource (and user). [14] intro- can later use for retrieval. The system assists the user with a tag duced a slightly different folksonomy graph model in which edges autocompletion relying on her previously used tags. Next to the are weighted and directed. On the resulting graph, PageRank is possibility of filtering resources by tag and/or user, users can find used to produce a ranking of tags. [12] proposed ‘TagRank’, a vari- new interesting resources through a full text search. Additionally, ant of topic-sensitive PageRank upon a tag-tag correlation graph users can form groups in which they can share posts and litera- which they integrate into a hybrid with collaborative filtering and ture. On group pages, all group members’ resources are displayed. popularity-based algorithms. The selection of the algorithms for Overview pages for websites and publications enable users to see the hybrid is guided by a greedy algorithm. A drawback of all the who else bookmarked a specific resource and the tags they used presented algorithms is their reliance on complex methodology that to describe the resource. Detail pages for publications show the uses the full corpus of folksonomy data to learn a recommendation metadata that the user who saved the resource had entered for the model. While they are suitable approaches to boost performance, publication. While browsing the system, a user can copy resources they also require a lot of effort in terms of additional computation of other users into her own collection. Another feature allows users time, hardware, implementation (e.g., additional data structures, to group tags to concepts (e.g., the tags "time" and "tag" to the con- methods to update the trained models), and expertise. Due to the cept "recsys"). These concepts can be used for retrieving resources. fact that a folksonomy changes over time, the learned models must BibSonomy is a popular target for spammers, that is, for users who be updated regularly to fit the current data. store links to advertisements to promote their visibility. For that In [21], the authors introduce GIRP, a temporal tag usage pattern reason, users are classified by a learning algorithm and manually model. It uses an exponential function that considers the first- and by the system’s administrators. For our analysis, we only used data last-time usage of a tag. A short-term interests model is proposed generated by users that were not marked as spammers. in [20], recommending the most popular tags of users based on re- cent data, that is, data from a time-window of fixed length (one day 4.2 Datasets or higher). It is found that a window of 30 days works best on the overall BibSonomy dataset. Recently, Trattner et al. [19] presented Our experiments rely on two types of data gathered from the real- a comprehensive study of various tag recommender strategies, in- world tagging system BibSonomy: posts and user interactions. The cluding their own development based on a model of human memory latter type of data is rarely published – due to privacy concerns. (BLL). In contrast to GIRP [21], BLL models the temporal tag usage However, BibSonomy makes such data available to researchers using a power function rather than an exponential function (see Sec- in the form of collected HTTP-request server logs.5 Thus, at the tion 2.2). They compare BLL with other methods (including several moment, BibSonomy is the only source enabling the analyses pre- of those mentioned above) and find that BLL performs better than sented here. The methodology, however, is transferable to other the time-dependent algorithm GIRP and other methods based on tagging systems. matrix factorization. Only more computationally expensive models Request Log Data: The request log data contains every web re- achieve a higher F-score on the evaluated datasets. Most of these quest any user made to the tagging system. We removed all non- models extend basic models by re-ranking the tag candidates by human requests like redirects to other pages, or requests by bots or the semantic context of the resource. other applications (using the user agent information). Also, we only In this work, we assume the perspective of a tagging system considered requests to HTML sites and excluded system pages (e.g., operator or, respectively, the operator of a system that includes the login page). We used two different time frames for the evalu- tagging as a secondary feature. We aim at supporting the tagging ation: (i) from 2006-01-01 (the start date of BibSonomy) through process with as little cost as possible while still delivering good 2011-12-31 (a dataset already used in behavioral analyses in previ- results. Following the strong results of BLL in [19], and given its ous work [5]) and (ii) a more recent share, ranging from 2014-07-01 low computational effort and the convenient fact that it requires 5 http://www.kde.cs.uni-kassel.de/bibsonomy/dumps RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy Daniel Zoller, Stephan Doerfel, Christian Pölitz, and Andreas Hotho Table 1: Statistics about the content datasets, where |U | is the External Referer: When users use external engines for searching number of users, |P | the number of posts, |R| the number of and click on results linking to BibSonomy, the external source is resources, |T | the number of tags and |Y | the number of tag logged as referer in the request logs. For candidate extraction, we assignments. tokenize the value of the ’q’ url parameter (because it is the most commonly used parameter by search engines) of these requests and Dataset |U | |P | |R| |T | |Y | remove stop words. Bib1 3,674 180,807 162,364 74,634 672,249 Tags of a copied post: While browsing by tags or searching, the Bib2 1,912 53,098 46,441 33,937 207,709 system presents the user with the resources that match her entered query. Next to every resource, the user can click on a copy button. through 2016-06-30. For the remainder of this paper, we refer to the This click is recorded in the request logs. We extract the tags of the older dataset as Bib1 and to the newer dataset as Bib2 . The older copied resource to represent this type of request. dataset contains 2,074,182 and the newer 246,472 requests. While some of the request types (e.g., concepts) are specific to Content Data: The second type of data is the data generated by BibSonomy, most of them are usually present in a tagging sys- the users of the system by annotating resources with tags. We split tem, representing search options as well as the typical navigation the content dataset of BibSonomy into the same two time frames paradigm in a folksonomy. as the interaction data. We applied several cleaning steps to each dataset. First, we removed system tags (like myown or imported). 4.4 Evaluation To filter imports, we deleted all posts of a user that shared a posting For tuning the parameters t and n of the two recommender heuris- date with another post.6 The remaining tags were normalized by tics, LRt and LRn , we split each of the two datasets into a validation decapitalizing and removing all non-alphanumeric characters from and a test set: each part contains 50 % of the posts. We use the the tag string. We did not prune our datasets using a p-core, to avoid validation set to determine the best parameters t and n. Then we biases to the results (see Section 4.4). Furthermore, the goal of this use the test to evaluate a hybrid with BLL (cf. Section 2.2). research is to produce broadly applicable recommender strategies, The choice of the evaluation setup often has a strong influence but restricting the dataset to only its dense part would neglect new on the experiments. For example, Cremonesi et al. [3] observed users, rare tags and rare resources, thus providing an incomplete that different sampling strategies yield different outcomes, while impression on the overall performance. The statistics for the two Doerfel et al. [4] showed that different restrictions on the dataset can cleaned content datasets can be found in Table 1. also lead to different (contradictory) results. Both suggest that the scenario should be selected such that it resembles reality as much 4.3 Extraction of Tag Recommendation as possible and that choices should be based on the use case rather Candidates from User Interaction than on issues like sparsity. Therefore, we adapt the rating-based While browsing in a tagging system, the user queries different page temporal leave-one-out method introduced in [2] to our scenario. In types. For extracting tag recommendation candidates from requests our experiments, we consider each post as a test post. Moreover, we to BibSonomy, we are using the following methods for the different ensure that the algorithms only use data from before the creation of page types. After the extraction, we also normalized the extracted the test posts. More specifically, for each post p, we do the following: tags as described in the previous section. We select all posts (and requests) that have been created before p Tag Pages: In BibSonomy, users can restrict the global collection of and use it to compute recommendations for p based on the user resources or the collections of other users, groups and search results and resource of p. Then we compare the recommended tags to the by tags on a separate tag page for the corresponding entity. All actual tags of p and evaluate the number of correctly predicted tags. pages also support to filter with more than one tag. We represented This scenario is the most realistic offline evaluation scenario as it a request to one of the tag pages with the specified tag(s). considers each occasion for recommendations and uses only data User Pages: For user pages, we extracted the user’s tags that she resembling exactly the state of the system at the time the test post used before the request was made for her own posts. We only was actually created. considered user pages where the logged-in user requested a user Metric: We use the standard information performance metric F- page of another user. This is the same representation that [13] used score for measuring the quality of the recommendations [7]. Online for their analysis. systems usually present users with only a limited number of tag Resource Pages: Also, like [13], we represent a publication or recommendations while saving a new resource (often five tags). For website overview page with the tags that any user used to describe that reason, we report F@5, that is, the F-score computed for the the requested resource. We restrict the tags extracted from a details set of the first five suggested tags from the ranked list of recommen- page to the page owner’s tags. dations (cf. Section 2.2). The parameters t and n are selected based Concept Pages: Users can request concept pages for users, groups, on the best F@5 score found in experiments on the validation sets. or globally, showing resources tagged with the keywords the user, Significance: To test the obtained results for statistically signifi- group or all users defined as subtags of the concept. We added all cant differences, we use the Wilcoxon signed-rank test. Since we subtags of the concept to the set of considered tag candidates. consider a large number of posts for the evaluation, we compute Search Pages: We represent a search request to BibSonomy with two versions – considering either the posts or the users as the popu- the terms of the search query after removing stop words using a lation. For the latter case, we averaged the obtained F@5 scores over multi-language stop word list. all posts per user in the test set, and we conduct the significance 6 Furthermore, we removed user accounts that are used by libraries, like DBLP, from test based on these averages. To indicate statistical significance, we the two BibSonomy datasets. use the symbol * after a reported F-score when the user-based test Leveraging User-Interactions for Time-Aware Tag Recommendations RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy indicates significance, and we use + for the post-based test, in both Table 2: F-scores for LRn and LRt with their corresponding cases using the α-level of 0.01. best configurations N and T . For comparison both BLL and the hybrid recommender are reported for the subset. We also 5 RESULTS report the number of considered users |U | and posts |P |. Sym- bols * and + indicate a statistically significant difference (see In this section, we present the results of our experiments. In the first Section 4.4 for details). set of experiments, we use the validation datasets to tune our two approaches LRt and LRn , using various settings for t and n, thus (a) Results for LRn . We report the significant difference of BLL com- pared to both other methods. including different sets of user interactions. In all these experiments, we consider only those posts where the situation was suitable for Bib1 (N=6) Bib2 (N=7) |U | |U | the respective approach, that is, where interactions with the system F@5 F@5 |P | |P | had been recorded. Thus, the number of considered posts for which BLL 0.27*+/ *+ 0.32*+/ *+ an approach is tested varies from recommender type and setting. 62,541 17,788 1,998 LRn=N 0.13 0.10 509 On the same subsets of posts, we also evaluate BLL and the hybrid with BLL for comparison. Eventually, in Section 5.2, we use the LRn=N + BLL 0.16 0.17 test datasets to evaluate the impact of our approach on the overall performance, that is, we evaluate the recommender strategies using (b) Results for LRt . We report the significant difference of LRt and LRt + BLL compared to BLL. all posts without any restriction. Bib1 (T=1m) Bib2 (T=1m) 5.1 Time-Aware Interaction-Driven |U | |U | F@5 F@5 |P | |P | Recommenders BLL 0.31 0.35 6,974 910 163 First, we report the results for the recommenders LRt and LRn (see 0.32*+ 927 LRt =T 0.33 Section 2.1). We report results averaged over all posts where the re- LRt =T + BLL 0.42*+ 0.45+ spective heuristic was applicable (i.e., where there were observable interactions), and we compare to BLL on the same set of posts. In our first evaluation, we vary the parameter n – the number of recommender BLL hybrid 0.425 included previous interactions – of the LRn recommender from 1 to 0.400 10 on the validation set and report the results of the best parameter 0.375 F@5 0.350 in Table 2a. We find that on Bib1 and Bib2 similar configurations 0.325 (n = 6 for Bib1 and n = 7 for Bib2 ) worked best for LRn . Further, 0.300 1m 2m 3m 4m 5m we can observe that LRn alone is clearly inferior to BLL in terms of 6,974/927 9,718/1,142 11,527/1,253 12,937/1,322 14,066/1,374 F@5. Combining results by concatenating the lists of recommended t (with |P| / |U|) tags (LRn + BLL), as described in Section 2.2, can improve the F@5 Figure 1: F-scores for BLL and the hybrid LRt + BLL on the score, but still cannot reach that of plain BLL. Our hypothesis is that test set of Bib1 . The time-window t ranges from one to five the last requests are too far in the past and thus have no relevance minutes. For each t the number of posts and users in the for the current post. Thus, the time of the considered interactions considered subset are reported. is a critical factor. Therefore, in Table 2b, we switch the mode of selecting requests from one minute to five minutes. We can observe, that BLL re- to time-windows, using LRt . We vary the considered time-window mains roughly constant when t increases. On the other hand, the for including requests from one minute to 30 days.7 Although the combination of LRt and BLL decreases from 0.42 to 0.35 when we number of posts for which the heuristic is applicable grows with the increase the time-window t from one minute to five minutes, but selected time-window length, we observe decreasing scores for LRt outperforms BLL for every considered time-window. For the dataset on both datasets – more evidence for the above hypothesis. We find Bib2 , we find similar results, except that LRt decreases faster (figure that using a time-window t of one minute yields the highest perfor- omitted due to space limitations). mance on (the respective validation sets of) both datasets, Bib1 and Bib2 . Other than before with LRn , the time-aware interaction-based 5.2 Overall Performance heuristic LRt yields F@5 scores comparable to BLL on both Bib1 In the previous section, we saw that the time-window based LRt and Bib2 . When combining the two recommenders (LRt =T + BLL), recommender achieved better results than BLL on those subsets the scores improve significantly, by ten percentage points over of the data where the respective heuristic was applicable. In this plain BLL on both sets. The improvements obtained for Bib1 are section, we evaluate the hybrid of LRt with BLL on the complete significant according to the Wilcoxon signed-rank test, considering test sets to get an impression of its overall impact as an improve- both the users and the posts as entities. On Bib2 , the significance ment over plain BLL. In the following, we combine LRt with those of the hybrid’s improvement is confirmed when the posts are used parameters that produced the best results on the validation sets. Re- as entities in the test. sults are given in Table 3. The combination with the request-based Figure 1 shows the trend of the F@5 scores calculated on BLL LRt recommender improves the recommendation result by about compared to the hybrid LRt + BLL for the time-windows ranging three per cent on the Bib1 test set and one per cent on the Bib2 test 7 30 days is the setting for which Yin et al. [20] report the best results on the overall set. A Wilcoxon signed-rank test, conducted on the average F-score dataset for their approach based in previous posts. of each user, indicates that the difference is significant for Bib1 . RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy Daniel Zoller, Stephan Doerfel, Christian Pölitz, and Andreas Hotho Table 3: F@5-scores of hybrid recommender and BLL on the Computational Cost: In contrast to many other methods (cf. Sec- full BibSonomy test sets. Symbols * and + indicate a statisti- tion 3), our heuristic does not require large user profiles, as it draws cally significant difference (see Section 4.4 for details). the recommendations only from the interactions in the tagging sys- tem directly preceding a new post. Moreover, the heuristic is easy Bib1 Bib2 to implement into arbitrary tagging systems, and requires only data BLL 0.265 0.315 collected from browsing activities. While, in our experiments, we LRt =1m + BLL 0.274*+ 0.318+ made use of the request logs to exploit interactions, in a production environment tags can be derived directly from the interactions and Testing with all posts as entities, significance is confirmed for both can be stored into a temporary cache. Thus, the recommendations Bib1 and Bib2 . Thus, we can conclude that exploiting user inter- can be computed directly without accessing additional data sources. actions in very short time-windows immediately before posting a Relying solely on counting occurrences and restrictions on small resource can boost the performance of the already well-performing subsets of interactions, recommendations can be computed online recommender algorithm BLL. without previous training. Consequently, they require only little effort, making them ideal candidates for systems where tagging is included as a secondary feature and for quickly prototyping a new 6 DISCUSSION AND LIMITATIONS tagging system. In our studies, we found that the time-window based recommender Explainability: Finally, it is worth pointing out that, in our ap- LRt provides better recommendations than the recommender LRn , proach, the choice of the recommended tags can be easily explained. which uses the last requests. We also saw that in those situations Explanations are suitable for increasing users’ acceptance of the where it is applicable its results are comparable to the more complex recommendations, particularly as the explanations reveals that no algorithm BLL. Combining the time-window variant with BLL into profiling of the user is necessary as only the few recent activities a hybrid significantly improves the performance. In the following are exploited. we discuss several aspects and limitations of our study. Timing: Overall, we could demonstrate the applicability of our 7 CONCLUSION time-aware interaction-driven heuristic. We saw particularly that In this work, we have proposed a time-aware, interaction-driven short time spans immediately before the posting of a resource yield tag recommendation heuristic, for the first time leveraging user good recommendations. It seems that users store posts in bursts, interaction in a tagging system beyond the publicly visible posting. each representing different aspects of their (shifting) interests. Thus, We have evaluated our approaches on data of the real-world tagging relying on very recent tags is a reasonable approach. system BibSonomy. We have shown that time is a critical factor Applicability: The number of posts and users for which the time- and that particularly interactions immediately before a new post window based recommender LRt can provide tag recommendations are a good source for recommendable tags. is only a relatively small subset of the data. One reason for this Finally, combining the time-aware approach with BLL leads to phenomenon may be the fact that BibSonomy offers browser ex- a hybrid recommender that outperforms both individual compo- tensions for posting resources, thus a useful means of posting to nents. The result is an effective recommender system that is easy BibSonomy without visiting the system first. to implement and independent of the tagged resources and that Generalizability of the Results: Since we could evaluate our requires no offline training. The approach is thus not only suitable recommendation approach only on two BibSonomy datasets (due for dedicated tagging systems, but also for broader web systems in to the unavailability of suitable data from other systems), it remains which tagging is merely a secondary feature. an open research question to see how it would perform in other Future work: In this study, we have evaluated LRt and LRn with tagging systems. Heckner et al. [8] found that users of tagging fixed parameters (time-window length or number of considered systems with different resources tend to tag for different reasons, interactions) for all users. A chance for improving the results fur- for example, Flickr (images) is used mostly for sharing, Delicious ther would be an analysis of personalized parameters for each (websites) mainly for retrieving resources from one’s own collection. user or different user groups. It would also be conceivable to use Since the resource types in BibSonomy and Delicious are similar more refined methods to detect the user’s current context when (references to documents; websites in both systems, publications posting a new resource. For example, by looking at the requests, only in BibSonomy), we hypothesize that they are often used for it might be possible to distinguish situations where users do re- similar purposes. Also, most page types (e.g., a user page) exist search regarding one specific topic from situations where users just in both systems. Thus, we would expect results of recommender “stumble” through the system changing their focus based on what algorithm experiments conducted on Delicious to be qualitatively they find on each page. Another topic for future work is the tag similar to those found for BibSonomy. Another influence on the candidate extraction. For this study, we extracted only the directly usage of tagging systems is the user interface which varies from requested tags or query terms from a request. In tagging systems, system to system. For example, BibSonomy always links the entities one could leverage the semantic context of such tags (arising from of the folksonomy, while other systems may exclude some links or the co-occurrence with other tags on the same resources). Finally, place links on different positions within a page. instead of exploiting retrieval interaction on the level of individual Transferability of the Approach: While the above mentioned is- requests, one could attempt to identify sessions. These might yield sues are limitations to the generalizability of the results in our study, a more comprehensive understanding of the current user context our methods of extracting tag recommendations from interactions than considering individual requests independently. can easily be adapted to other tagging systems. Leveraging User-Interactions for Time-Aware Tag Recommendations RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy REFERENCES [11] Marek Lipczak, Yeming Hu, Yael Kollet, and Evangelos Milios. 2009. Tag Sources [1] Dominik Benz, Andreas Hotho, Robert Jäschke, Beate Krause, Folke Mitzlaff, for Recommendation in Collaborative Tagging Systems, See [6], 157–172. Christoph Schmitz, and Gerd Stumme. 2010. The social bookmark and publication [12] Feichao Ma, Wenqing Wang, and Zhihong Deng. 2013. TagRank: A new tag management system BibSonomy. The VLDB Journal 19, 6 (2010), 849–875. DOI: recommendation algorithm and recommender enhancement with data fusion https://doi.org/10.1007/s00778-010-0208-4 techniques. In Social Media Retrieval and Mining, Shuigeng Zhou and Zhiang Wu [2] Robin Burke. 2010. Evaluating the Dynamic Properties of Recommendation (Eds.). Communications in Computer and Information Science, Vol. 387. Springer, Algorithms. In Proceedings of the Fourth ACM Conference on Recommender Sys- Berlin/Heidelberg, 80–91. DOI:https://doi.org/10.1007/978-3-642-41629-3_7 tems. ACM, New York, NY, USA, 225–228. DOI:https://doi.org/10.1145/1864708. [13] Thomas Niebler, Martin Becker, Daniel Zoller, Stephan Doerfel, and Andreas 1864753 Hotho. 2016. FolkTrails: Interpreting Navigation Behavior in a Social Tagging [3] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of System. In Proceedings of the 25th ACM International on Conference on Information Recommender Algorithms on Top-n Recommendation Tasks. In Proceedings of and Knowledge Management. ACM, New York, NY, USA. DOI:https://doi.org/10. the Fourth ACM Conference on Recommender Systems. ACM, New York, NY, USA, 1145/2983323.2983686 39–46. DOI:https://doi.org/10.1145/1864708.1864721 [14] Maryam Ramezani. 2011. Improving graph-based approaches for personalized [4] Stephan Doerfel, Robert Jäschke, and Gerd Stumme. 2016. The Role of Cores in tag recommendation. Journal of Emerging Technologies in Web Intelligence 3, 2 Recommender Benchmarking for Social Bookmarking Systems. ACM Transac- (2011), 168–176. DOI:https://doi.org/10.4304/jetwi.3.2.168-176 tions on Intelligent Systems and Technology 7, 3 (February 2016), 40:1–40:33. DOI: [15] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Schmidt-Thieme https://doi.org/10.1145/2700485 Lars. 2009. BPR: Bayesian personalized ranking from implicit feedback. In [5] Stephan Doerfel, Daniel Zoller, Philipp Singer, Thomas Niebler, Andreas Hotho, Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. AUAI and Markus Strohmaier. 2016. What Users Actually do in a Social Tagging Press, Arlington, Virginia, United States, 452–461. System: A Study of User Behavior in BibSonomy. ACM Transactions on the Web [16] Steffen Rendle and Lars Schmidt-Thieme. 2009. Factor Models for Tag Recom- 10, 2 (2016), 14:1–14:32. DOI:https://doi.org/10.1145/2896821 mendation in BibSonomy, See [6], 235–242. [6] Folke Eisterlehner, Andreas Hotho, and Robert Jäschke (Eds.). 2009. ECML PKDD [17] Paul Seitlinger, Dominik Kowald, Christoph Trattner, and Tobias Ley. 2013. Rec- Discovery Challenge 2009 (DC09). CEUR-WS.org, Vol. 497. ommending tags with a model of human categorization. In Proceedings of the [7] Asela Gunawardana and Guy Shani. 2009. A Survey of Accuracy Evaluation 22nd International Conference on Conference on Information and Knowledge Man- Metrics of Recommendation Tasks. Journal of Machine Learning Research 10 agement. ACM, New York, NY, USA, 2381–2386. DOI:https://doi.org/10.1145/ (2009), 2935–2962. 2505515.2505625 [8] M. Heckner, M. Heilemann, and C. Wolff. 2009. Personal Information Manage- [18] Sanjay Sood, Sara Owsley, Kristian Hammond, and Larry Birnbaum. 2007. TagAs- ment vs. Resource Sharing: Towards a Model of Information Behaviour in Social sist: Automatic Tag Suggestion for Blog Posts. In Proceedings of the 1st Interna- Tagging Systems. In Proceedings of the thrid AAAI Conference on Weblogs and tional Conference on Weblogs and Social Media. Boulder, Colorado, USA. Social Media. San Jose, CA, USA. [19] Christoph Trattner, Dominik Kowald, Paul Seitlinger, Tobias Ley, and Simone [9] Andreas Hotho, Robert Jäschke, Christoph Schmitz, and Gerd Stumme. 2006. Kopeinik. 2016. Modeling Activation Processes in Human Memory to Predict the Information Retrieval in Folksonomies: Search and Ranking. In The Semantic Use of Tags in Social Bookmarking Systems. Journal of Web Science 2, 1 (2016), Web: Research and Applications: Third European Semantic Web Conference, ESWC 1–16. DOI:https://doi.org/10.1561/106.00000004 2006 Budva, Montenegro, June 11-14, 2006 Proceedings (LNCS), York Sure and [20] Dawei Yin, Liangjie Hong, Zhenzhen Xue, and Brian D. Davison. 2011. Temporal John Domingue (Eds.), Vol. 4011. Springer Berlin Heidelberg, Berlin/Heidelberg, Dynamics of User Interests in Tagging Systems. In Proceedings of the 25th AAAI 411–426. DOI:https://doi.org/10.1007/11762256_31 Conference on Artificial Intelligence. AAAI. [21] Lei Zhang, Jian Tang, and Ming Zhang. 2012. Integrating Temporal Usage [10] Robert Jäschke, Leandro Marinho, Andreas Hotho, Lars Schmidt-Thieme, and Pattern into Personalized Tag Prediction. In Proceedings of the 14th Asia- Gerd Stumme. 2008. Tag Recommendations in Social Bookmarking Sys- Pacific International Conference on Web Technologies and Applications (AP- tems. AI Communications 21, 4 (2008), 231–247. DOI:https://doi.org/10.3233/ Web’12). Springer-Verlag, Berlin, Heidelberg, 354–365. DOI:https://doi.org/10. AIC-2008-0438 1007/978-3-642-29253-8_30