Leveraging User-Interactions for Time-Aware Tag
                                    Recommendations
                                     Daniel Zoller                                                      Stephan Doerfel
                     University of Würzburg                                                             Micromata GmbH
       Data Mining and Information Retrieval (DMIR) Group                                                     Kassel
              zoller@informatik.uni-wuerzburg.de                                                     s.doerfel@micromata.de

                                   Christian Pölitz                                                      Andreas Hotho
                     University of Würzburg                                                       University of Würzburg
       Data Mining and Information Retrieval (DMIR) Group                           Data Mining and Information Retrieval (DMIR) Group
              poelitz@informatik.uni-wuerzburg.de                                                  L3S Research Center
                                                                                           hotho@informatik.uni-wuerzburg.de
ABSTRACT                                                                            taking all of a user’s previous tag assignments into account. Next to
For the popular task of tag recommendation, various (complex)                       its strong performance, the method has another significant advan-
approaches have been proposed. Recently however, research has                       tage over comparably effective methods: The recommendations can
focused on heuristics with low computational effort and particularly,               be computed online – no offline training is required. The latter can
a time-aware heuristic, called BLL, has been shown to compare well                  be a decisive factor for system operators, especially where tagging
to various state-of-the-art methods. Here, we follow up on these                    is only a secondary feature and expensive offline training seems out
results by presenting another time-aware approach leveraging user-                  of proportion due to the amount of required resources (hardware,
interaction data in an easily interpretable, on-the-fly computable                  time, and expertise).
approach that can successfully be combined with BLL.                                    In this paper, we follow the incentive to build such lightweight
    We investigate the influence of time as a parameter in that                     but effective methods. To that end, we use BLL and augment it
approach, and we demonstrate the effectiveness of the proposed                      with a time-aware heuristic relying on a new source of data, that is
method using two datasets from the popular public social tagging                    usually available in all tagging systems and can be exploited easily
system BibSonomy.                                                                   and on the fly: the interactions of users with the system. We exploit
                                                                                    the context of requested pages in a tagging system to generate
KEYWORDS                                                                            tags as recommendation candidates. We demonstrate that for the
                                                                                    selection of those requests, time is a critical factor. We evaluate
tag recommendation; user behavior; time-aware
                                                                                    the success of our approach on a dataset of the public real-world
                                                                                    tagging system BibSonomy.
1     INTRODUCTION                                                                  Contributions: The main contributions of this investigation are:
Tagging allows users to easily organize and share resources online.
Users add keywords, called “tags”, to resources and store them                          (1) We describe a heuristic to harvest tag recommendation
together, thus enabling themselves and others to retrieve these                             candidates from user interactions with the system before
resources using those tags as cues. Tagging is the basis of social                          posting a resource.
bookmarking systems, such as Delicious,1 BibSonomy,2 or Flickr.3                        (2) We test two variants of a tag recommender, that utilize the
However, it has also long found its way as a secondary feature into                         extracted recommendation candidates, one using a fix num-
many applications, like web shops, wikis, blogs, or libraries.                              ber of the most recent interactions and one using specific
   The task of recommending tags has proven to be a fruitful line of                        time-windows over which interactions are considered.
research. Many approaches (see Section 3) utilize complex or com-                       (3) We evaluate the overall gain in performance for recom-
putationally costly models, often relying on the full data collected                        mendations of the interaction-based heuristics, as well as
in the system. However, [10] already showed that simple and fast                            a hybridized version in combination with BLL.
popularity-based heuristics can achieve comparably good results
as more difficult and more expensive state-of-the-art methods.                      Our results demonstrate that the time-aware approach provides
   Recently, [19] confirmed this tendency in a large-scale experi-                  better suggestions than the variant using a fixed number of previ-
ment on multiple datasets, using a time-aware heuristic that trans-                 ous interactions. Furthermore, a hybrid combining the time-aware,
fers the base-level learning (BLL) equation to social tagging. BLL                  interaction-based recommender with BLL outperforms plain BLL.
scores a tag based on the time that has passed since it was last used,              Moreover (like plain BLL), our proposed recommender is indepen-
                                                                                    dent of the tagged resources (making no use of their contents) and
1 https://del.icio.us (for storing web links)
2 https://www.bibsonomy.org (for storing web links and publications)                respects the requirement of easy, online computability. We expect
3 https://www.flickr.com (for storing images)                                       our approach to be relevant for the tag recommender community,
                                                                                    as well as for operators of web systems who want to support their
RecTemp Workshop @ ACM RecSys ’17, Como, Italy                                      users with tag recommendations without spending effort and re-
Copyright © 2017 for this paper by its authors. Copying permitted for private and   sources on costly optimization procedures.
academic purposes.
RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy                Daniel Zoller, Stephan Doerfel, Christian Pölitz, and Andreas Hotho


   The remainder of this paper is structured as follows: First, we           much) smaller than the complete list of a user’s previously used
introduce our recommender (Section 2) that leverages user inter-             tags, let alone the complete historical posting data in the system.
action in a tagging system and the incorporation of the aspect of
time. Then, we discuss related work in Section 3. We describe the            2.2    Hybrid with BLL
dataset and experimental setup in Section 4. Next, we present our
                                                                             Both proposed variants of the tag recommender cannot provide
evaluation results for time-based and user interaction-based recom-
                                                                             results for new users that post resources immediately after regis-
menders in Section 5. Finally, we conclude the paper in Section 7,
                                                                             tration without interacting with the system. Also, it is not unusual
after discussing our findings in Section 6.
                                                                             that users do not interact with the system before they store a new
                                                                             resource. This fact also impedes the time-aware variant LRt . To
2     RECOMMENDERS
                                                                             compensate, we combine each of the interaction-based methods
Tag recommenders provide recommendations during the process                  with BLL into a hybrid recommender. BLL has been shown to out-
of posting a resource. When a user u is posting a resource r , the           perform other (time-dependent) approaches (see Section 3). It uses
recommender will provide a list of tags that u might want to assign          all tags previously used by the active user and computes a recency-
to r . For that task, we propose and describe a new time-aware,              based ranking on them: Let Tu be the set of tags previously used
interaction-driven recommender which exploits the previous inter-            by user u and timep(p) the timestamp when user u stored post p.
actions of the active user (the one to provide recommendations for)          Further, let Yt,u be the set of tag assignments for tag t of the user u
with the tagging system in this section. Lastly, we describe a hybrid        (i.e., we add a tuple (u, r , t) every time a user u annotates a resource
approach, combining our approach with BLL (Trattner et al. [19]).            r with tag t to the corresponding set), and time(y) be the timestamp
                                                                             of the tag assignment y ∈ Yt,u . The BLL-score of each tag in Tu
2.1      Time-Aware Interaction-Driven                                       is calculated as ln( y ∈Yt,u (timep(p) − time(y))−d ) and normalized
                                                                                                  Í
         Recommenders                                                        by the softmax function over all scores. We set d = 0.5 for our
In tagging systems, and more so in systems where tagging is a                evaluation – the setting which obtained the best results in [19].
secondary feature, a user’s interactions with the system are much                 Our interaction-based recommender approaches are combined
more diverse than just adding and tagging new resources. Still, tag          with BLL in the following way: First, the interaction-based method
recommendation approaches commonly focus only on the result                  is used to compute a ranked list L 1 of candidates (possibly of length
of these interactions (the tagged resources) to compute candidates           0). Similarly, BLL also yields a ranked list L 2 of candidate tags. From
to recommend. In this paper, we propose going beyond only the                L 2 , we remove all tags that also occur in L 1 . Then the two lists are
tagging activities and consider the context of all interactions that a       concatenated, such that the final ranked list of recommendations
user has with the system. To generate meaningful recommendations             contains the suggestions from L 1 followed by those from L 2 . We
for a resource r , we focus especially on the most recent interactions,      denote the two resulting hybrids by LRt + BLL and LRn + BLL.
as they are likely to belong to the same context as r . The rationale
behind the approach is, that by browsing and searching resources,            3     RELATED WORK
users reveal their current interests, which are useful to select fitting
                                                                             Recommending tags can serve various purposes, such as increasing
tags. A user’s interactions occur in the requests she sends to the
                                                                             the chance of getting a resource annotated, reminding a user what
system. Therefore, we call our recommender approach Last Requests
                                                                             a resource is about, and consolidating the vocabulary across users.
(LR). To select requests from which candidate tags are computed
                                                                             Furthermore, as Sood et al. [18] pointed out, tag recommenders
we distinguish two options: First, a time-window based variant
                                                                             lower the effort of annotation by changing the process from a
LRt : When user u stores a new resource r , recommendations are
                                                                             generation to a recognition task: rather than “inventing” tags, the
computed using all previous requests that u made during the time
                                                                             user only needs to select some of the recommended tags.
frame of length d immediately preceding the current posting of r .
                                                                                Since the emergence of social bookmarking, the topic of tag rec-
In the second variant, called LRn , only the last n interactions of u
                                                                             ommendations has raised considerable interest among researchers.
are considered. While the first variant is explicitly time-aware and
                                                                             As for all recommender domains, tag recommender algorithms can
thus more likely to capture interactions relevant in the context of
                                                                             roughly be classified into three classes: Content-based algorithms
the active post, the second variant is more broadly applicable, as
                                                                             use the content of resources, for instance to compute similarities
users do not necessarily interact with the system prior to posting a
                                                                             between items and to present items that are similar to the ones the
new resource (e.g., when using a posting tool, like a bookmarklet
                                                                             active user previously liked. Collaborative algorithms make use of
or browser add-on). In such a case, older but perhaps still relevant
                                                                             the relations between the users and the items, for instance by iden-
information could be used.
                                                                             tifying similar users and suggesting items similar users liked. The
   In both variants, recommendation candidates are derived from
                                                                             third class are algorithms that exploit both data sources, sometimes
each of the considered interactions. To yield a ranked list of rec-
                                                                             called hybrid recommenders. A major drawback of content-based
ommendations, the resulting tags are ordered by their frequency
                                                                             approaches is that the usability of a resource’s content depends
among the interactions.4 We describe methods for deriving recom-
                                                                             on the type of resource. For example, when the tagged resources
mendable tags from interactions in Section 4.3. Both recommenders
                                                                             are textual, using words from those resources (e.g., from the title,
can compute the candidates more efficiently than other recom-
                                                                             like Lipczak et al. [11] did) can be successful. However, the same
menders that leverage all previously used tags of a user, because
                                                                             is much harder when the resources are images. Moreover, even
they only compute frequencies on a set of tags which is (usually
                                                                             for textual resources, the implementation of the recommendation
4 In the case of ties, we return the tags in lexicographic order.            (e.g., the word selection strategy) is resource dependent. Thus, since
Leveraging User-Interactions for Time-Aware Tag Recommendations         RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy


our goal is to create a simple and highly versatile recommender,        no extra data structures nor precomputed values (see Section 2.2),
we focus only on the second class of recommender algorithms –           we use this approach as our baseline. Our method is similar to that
those that exploit the folksonomy structure between users, tags,        of Yin et al. [20], however, instead of using tags of the previous
and resources, which exists in any tagging system.                      posts, we use tags extracted from previous user interactions with
    An evaluation of collaborative algorithms, such as collaborative    the tagging system. We will show that the time-frame for collecting
filtering, the FolkRank algorithm [9], and simpler, popularity-based    such interactions is critical and that time-frames of less than a day
methods was performed in [10] on various datasets. FolkRank out-        are worth considering.
performed the other methods. However, the hybrid heuristic based
recommender, that combined users’ frequently used tags with tags        4     EXPERIMENTAL SETUP
that were frequently used to annotate the resource, was second.         In this section, we introduce the tagging system BibSonomy, of
Rendle and Schmidt-Thieme [16] produced recommendations with            which we use data for our study. We further describe the datasets
a statistical method based on factor models. They factorized the        with all preprocessing steps, the experiments and their evaluation.
folksonomy structure to find latent interactions between users, re-
sources and tags. Using a variant of the stochastic gradient descent    4.1     BibSonomy
algorithm, the authors optimized an adaptation of the Bayesian
                                                                        BibSonomy [1] allows users to collect references to (scientific) pub-
Personal Ranking criterion [15]. Seitlinger et al. [17] proposed an
                                                                        lications and bookmarks to websites, and to annotate these with
approach that simulates human category learning in a three-layer
                                                                        arbitrary keywords, so called tags. While entering the metadata
connectionist network. In the input layer, Latent Dirichlet Allo-
                                                                        of a resource in a form, the user can also enter tags, which she
cation is used to characterize the resource (and user). [14] intro-
                                                                        can later use for retrieval. The system assists the user with a tag
duced a slightly different folksonomy graph model in which edges
                                                                        autocompletion relying on her previously used tags. Next to the
are weighted and directed. On the resulting graph, PageRank is
                                                                        possibility of filtering resources by tag and/or user, users can find
used to produce a ranking of tags. [12] proposed ‘TagRank’, a vari-
                                                                        new interesting resources through a full text search. Additionally,
ant of topic-sensitive PageRank upon a tag-tag correlation graph
                                                                        users can form groups in which they can share posts and litera-
which they integrate into a hybrid with collaborative filtering and
                                                                        ture. On group pages, all group members’ resources are displayed.
popularity-based algorithms. The selection of the algorithms for
                                                                        Overview pages for websites and publications enable users to see
the hybrid is guided by a greedy algorithm. A drawback of all the
                                                                        who else bookmarked a specific resource and the tags they used
presented algorithms is their reliance on complex methodology that
                                                                        to describe the resource. Detail pages for publications show the
uses the full corpus of folksonomy data to learn a recommendation
                                                                        metadata that the user who saved the resource had entered for the
model. While they are suitable approaches to boost performance,
                                                                        publication. While browsing the system, a user can copy resources
they also require a lot of effort in terms of additional computation
                                                                        of other users into her own collection. Another feature allows users
time, hardware, implementation (e.g., additional data structures,
                                                                        to group tags to concepts (e.g., the tags "time" and "tag" to the con-
methods to update the trained models), and expertise. Due to the
                                                                        cept "recsys"). These concepts can be used for retrieving resources.
fact that a folksonomy changes over time, the learned models must
                                                                        BibSonomy is a popular target for spammers, that is, for users who
be updated regularly to fit the current data.
                                                                        store links to advertisements to promote their visibility. For that
    In [21], the authors introduce GIRP, a temporal tag usage pattern
                                                                        reason, users are classified by a learning algorithm and manually
model. It uses an exponential function that considers the first- and
                                                                        by the system’s administrators. For our analysis, we only used data
last-time usage of a tag. A short-term interests model is proposed
                                                                        generated by users that were not marked as spammers.
in [20], recommending the most popular tags of users based on re-
cent data, that is, data from a time-window of fixed length (one day
                                                                        4.2     Datasets
or higher). It is found that a window of 30 days works best on the
overall BibSonomy dataset. Recently, Trattner et al. [19] presented     Our experiments rely on two types of data gathered from the real-
a comprehensive study of various tag recommender strategies, in-        world tagging system BibSonomy: posts and user interactions. The
cluding their own development based on a model of human memory          latter type of data is rarely published – due to privacy concerns.
(BLL). In contrast to GIRP [21], BLL models the temporal tag usage      However, BibSonomy makes such data available to researchers
using a power function rather than an exponential function (see Sec-    in the form of collected HTTP-request server logs.5 Thus, at the
tion 2.2). They compare BLL with other methods (including several       moment, BibSonomy is the only source enabling the analyses pre-
of those mentioned above) and find that BLL performs better than        sented here. The methodology, however, is transferable to other
the time-dependent algorithm GIRP and other methods based on            tagging systems.
matrix factorization. Only more computationally expensive models        Request Log Data: The request log data contains every web re-
achieve a higher F-score on the evaluated datasets. Most of these       quest any user made to the tagging system. We removed all non-
models extend basic models by re-ranking the tag candidates by          human requests like redirects to other pages, or requests by bots or
the semantic context of the resource.                                   other applications (using the user agent information). Also, we only
    In this work, we assume the perspective of a tagging system         considered requests to HTML sites and excluded system pages (e.g.,
operator or, respectively, the operator of a system that includes       the login page). We used two different time frames for the evalu-
tagging as a secondary feature. We aim at supporting the tagging        ation: (i) from 2006-01-01 (the start date of BibSonomy) through
process with as little cost as possible while still delivering good     2011-12-31 (a dataset already used in behavioral analyses in previ-
results. Following the strong results of BLL in [19], and given its     ous work [5]) and (ii) a more recent share, ranging from 2014-07-01
low computational effort and the convenient fact that it requires       5 http://www.kde.cs.uni-kassel.de/bibsonomy/dumps
RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy                           Daniel Zoller, Stephan Doerfel, Christian Pölitz, and Andreas Hotho

Table 1: Statistics about the content datasets, where |U | is the                       External Referer: When users use external engines for searching
number of users, |P | the number of posts, |R| the number of                            and click on results linking to BibSonomy, the external source is
resources, |T | the number of tags and |Y | the number of tag                           logged as referer in the request logs. For candidate extraction, we
assignments.                                                                            tokenize the value of the ’q’ url parameter (because it is the most
                                                                                        commonly used parameter by search engines) of these requests and
      Dataset        |U |       |P |         |R|         |T |        |Y |
                                                                                        remove stop words.
      Bib1         3,674      180,807     162,364      74,634     672,249               Tags of a copied post: While browsing by tags or searching, the
      Bib2         1,912       53,098      46,441      33,937     207,709               system presents the user with the resources that match her entered
                                                                                        query. Next to every resource, the user can click on a copy button.
through 2016-06-30. For the remainder of this paper, we refer to the                    This click is recorded in the request logs. We extract the tags of the
older dataset as Bib1 and to the newer dataset as Bib2 . The older                      copied resource to represent this type of request.
dataset contains 2,074,182 and the newer 246,472 requests.                                 While some of the request types (e.g., concepts) are specific to
Content Data: The second type of data is the data generated by                          BibSonomy, most of them are usually present in a tagging sys-
the users of the system by annotating resources with tags. We split                     tem, representing search options as well as the typical navigation
the content dataset of BibSonomy into the same two time frames                          paradigm in a folksonomy.
as the interaction data. We applied several cleaning steps to each
dataset. First, we removed system tags (like myown or imported).                        4.4    Evaluation
To filter imports, we deleted all posts of a user that shared a posting
                                                                                        For tuning the parameters t and n of the two recommender heuris-
date with another post.6 The remaining tags were normalized by
                                                                                        tics, LRt and LRn , we split each of the two datasets into a validation
decapitalizing and removing all non-alphanumeric characters from
                                                                                        and a test set: each part contains 50 % of the posts. We use the
the tag string. We did not prune our datasets using a p-core, to avoid
                                                                                        validation set to determine the best parameters t and n. Then we
biases to the results (see Section 4.4). Furthermore, the goal of this
                                                                                        use the test to evaluate a hybrid with BLL (cf. Section 2.2).
research is to produce broadly applicable recommender strategies,
                                                                                           The choice of the evaluation setup often has a strong influence
but restricting the dataset to only its dense part would neglect new
                                                                                        on the experiments. For example, Cremonesi et al. [3] observed
users, rare tags and rare resources, thus providing an incomplete
                                                                                        that different sampling strategies yield different outcomes, while
impression on the overall performance. The statistics for the two
                                                                                        Doerfel et al. [4] showed that different restrictions on the dataset can
cleaned content datasets can be found in Table 1.
                                                                                        also lead to different (contradictory) results. Both suggest that the
                                                                                        scenario should be selected such that it resembles reality as much
4.3     Extraction of Tag Recommendation
                                                                                        as possible and that choices should be based on the use case rather
        Candidates from User Interaction                                                than on issues like sparsity. Therefore, we adapt the rating-based
While browsing in a tagging system, the user queries different page                     temporal leave-one-out method introduced in [2] to our scenario. In
types. For extracting tag recommendation candidates from requests                       our experiments, we consider each post as a test post. Moreover, we
to BibSonomy, we are using the following methods for the different                      ensure that the algorithms only use data from before the creation of
page types. After the extraction, we also normalized the extracted                      the test posts. More specifically, for each post p, we do the following:
tags as described in the previous section.                                              We select all posts (and requests) that have been created before p
Tag Pages: In BibSonomy, users can restrict the global collection of                    and use it to compute recommendations for p based on the user
resources or the collections of other users, groups and search results                  and resource of p. Then we compare the recommended tags to the
by tags on a separate tag page for the corresponding entity. All                        actual tags of p and evaluate the number of correctly predicted tags.
pages also support to filter with more than one tag. We represented                     This scenario is the most realistic offline evaluation scenario as it
a request to one of the tag pages with the specified tag(s).                            considers each occasion for recommendations and uses only data
User Pages: For user pages, we extracted the user’s tags that she                       resembling exactly the state of the system at the time the test post
used before the request was made for her own posts. We only                             was actually created.
considered user pages where the logged-in user requested a user                         Metric: We use the standard information performance metric F-
page of another user. This is the same representation that [13] used                    score for measuring the quality of the recommendations [7]. Online
for their analysis.                                                                     systems usually present users with only a limited number of tag
Resource Pages: Also, like [13], we represent a publication or                          recommendations while saving a new resource (often five tags). For
website overview page with the tags that any user used to describe                      that reason, we report F@5, that is, the F-score computed for the
the requested resource. We restrict the tags extracted from a details                   set of the first five suggested tags from the ranked list of recommen-
page to the page owner’s tags.                                                          dations (cf. Section 2.2). The parameters t and n are selected based
Concept Pages: Users can request concept pages for users, groups,                       on the best F@5 score found in experiments on the validation sets.
or globally, showing resources tagged with the keywords the user,                       Significance: To test the obtained results for statistically signifi-
group or all users defined as subtags of the concept. We added all                      cant differences, we use the Wilcoxon signed-rank test. Since we
subtags of the concept to the set of considered tag candidates.                         consider a large number of posts for the evaluation, we compute
Search Pages: We represent a search request to BibSonomy with                           two versions – considering either the posts or the users as the popu-
the terms of the search query after removing stop words using a                         lation. For the latter case, we averaged the obtained F@5 scores over
multi-language stop word list.                                                          all posts per user in the test set, and we conduct the significance
6 Furthermore, we removed user accounts that are used by libraries, like DBLP, from     test based on these averages. To indicate statistical significance, we
the two BibSonomy datasets.                                                             use the symbol * after a reported F-score when the user-based test
Leveraging User-Interactions for Time-Aware Tag Recommendations                             RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy


indicates significance, and we use + for the post-based test, in both                       Table 2: F-scores for LRn and LRt with their corresponding
cases using the α-level of 0.01.                                                            best configurations N and T . For comparison both BLL and
                                                                                            the hybrid recommender are reported for the subset. We also
5     RESULTS                                                                               report the number of considered users |U | and posts |P |. Sym-
                                                                                            bols * and + indicate a statistically significant difference (see
In this section, we present the results of our experiments. In the first
                                                                                            Section 4.4 for details).
set of experiments, we use the validation datasets to tune our two
approaches LRt and LRn , using various settings for t and n, thus                                 (a) Results for LRn . We report the significant difference of BLL com-
                                                                                                  pared to both other methods.
including different sets of user interactions. In all these experiments,
we consider only those posts where the situation was suitable for                                                               Bib1 (N=6)                          Bib2 (N=7)


                                                                                                                           |U |


                                                                                                                                                                |U |
the respective approach, that is, where interactions with the system                                                               F@5                                  F@5


                                                                                                                           |P |


                                                                                                                                                                |P |
had been recorded. Thus, the number of considered posts for which
                                                                                                    BLL                                 0.27*+/ *+                           0.32*+/ *+
an approach is tested varies from recommender type and setting.


                                                                                                                          62,541


                                                                                                                                                                17,788
                                                                                                                          1,998
                                                                                                    LRn=N                               0.13                                 0.10


                                                                                                                                                                509
On the same subsets of posts, we also evaluate BLL and the hybrid
with BLL for comparison. Eventually, in Section 5.2, we use the                                     LRn=N + BLL                         0.16                                 0.17
test datasets to evaluate the impact of our approach on the overall
performance, that is, we evaluate the recommender strategies using                                   (b) Results for LRt . We report the significant difference of
                                                                                                     LRt and LRt + BLL compared to BLL.
all posts without any restriction.
                                                                                                                                   Bib1 (T=1m)                   Bib2 (T=1m)
5.1      Time-Aware Interaction-Driven


                                                                                                                               |U |


                                                                                                                                                                 |U |
                                                                                                                                        F@5                             F@5


                                                                                                                               |P |


                                                                                                                                                                 |P |
         Recommenders                                                                                      BLL                             0.31                              0.35


                                                                                                                              6,974


                                                                                                                                                                910
                                                                                                                                                                163
First, we report the results for the recommenders LRt and LRn (see                                                                         0.32*+


                                                                                                                              927
                                                                                                           LRt =T                                                            0.33
Section 2.1). We report results averaged over all posts where the re-
                                                                                                           LRt =T + BLL                    0.42*+                            0.45+
spective heuristic was applicable (i.e., where there were observable
interactions), and we compare to BLL on the same set of posts.
   In our first evaluation, we vary the parameter n – the number of                                                        recommender             BLL        hybrid
                                                                                                          0.425
included previous interactions – of the LRn recommender from 1 to                                         0.400
10 on the validation set and report the results of the best parameter                                     0.375
                                                                                                    F@5


                                                                                                          0.350
in Table 2a. We find that on Bib1 and Bib2 similar configurations                                         0.325
(n = 6 for Bib1 and n = 7 for Bib2 ) worked best for LRn . Further,                                       0.300
                                                                                                                 1m           2m                3m                 4m               5m
we can observe that LRn alone is clearly inferior to BLL in terms of                                          6,974/927   9,718/1,142      11,527/1,253       12,937/1,322     14,066/1,374
F@5. Combining results by concatenating the lists of recommended                                                                         t (with |P| / |U|)
tags (LRn + BLL), as described in Section 2.2, can improve the F@5                          Figure 1: F-scores for BLL and the hybrid LRt + BLL on the
score, but still cannot reach that of plain BLL. Our hypothesis is that                     test set of Bib1 . The time-window t ranges from one to five
the last requests are too far in the past and thus have no relevance                        minutes. For each t the number of posts and users in the
for the current post. Thus, the time of the considered interactions                         considered subset are reported.
is a critical factor.
   Therefore, in Table 2b, we switch the mode of selecting requests                         from one minute to five minutes. We can observe, that BLL re-
to time-windows, using LRt . We vary the considered time-window                             mains roughly constant when t increases. On the other hand, the
for including requests from one minute to 30 days.7 Although the                            combination of LRt and BLL decreases from 0.42 to 0.35 when we
number of posts for which the heuristic is applicable grows with the                        increase the time-window t from one minute to five minutes, but
selected time-window length, we observe decreasing scores for LRt                           outperforms BLL for every considered time-window. For the dataset
on both datasets – more evidence for the above hypothesis. We find                          Bib2 , we find similar results, except that LRt decreases faster (figure
that using a time-window t of one minute yields the highest perfor-                         omitted due to space limitations).
mance on (the respective validation sets of) both datasets, Bib1 and
Bib2 . Other than before with LRn , the time-aware interaction-based                        5.2      Overall Performance
heuristic LRt yields F@5 scores comparable to BLL on both Bib1
                                                                                            In the previous section, we saw that the time-window based LRt
and Bib2 . When combining the two recommenders (LRt =T + BLL),
                                                                                            recommender achieved better results than BLL on those subsets
the scores improve significantly, by ten percentage points over
                                                                                            of the data where the respective heuristic was applicable. In this
plain BLL on both sets. The improvements obtained for Bib1 are
                                                                                            section, we evaluate the hybrid of LRt with BLL on the complete
significant according to the Wilcoxon signed-rank test, considering
                                                                                            test sets to get an impression of its overall impact as an improve-
both the users and the posts as entities. On Bib2 , the significance
                                                                                            ment over plain BLL. In the following, we combine LRt with those
of the hybrid’s improvement is confirmed when the posts are used
                                                                                            parameters that produced the best results on the validation sets. Re-
as entities in the test.
                                                                                            sults are given in Table 3. The combination with the request-based
   Figure 1 shows the trend of the F@5 scores calculated on BLL
                                                                                            LRt recommender improves the recommendation result by about
compared to the hybrid LRt + BLL for the time-windows ranging
                                                                                            three per cent on the Bib1 test set and one per cent on the Bib2 test
7 30 days is the setting for which Yin et al. [20] report the best results on the overall   set. A Wilcoxon signed-rank test, conducted on the average F-score
dataset for their approach based in previous posts.                                         of each user, indicates that the difference is significant for Bib1 .
RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy                 Daniel Zoller, Stephan Doerfel, Christian Pölitz, and Andreas Hotho

Table 3: F@5-scores of hybrid recommender and BLL on the                      Computational Cost: In contrast to many other methods (cf. Sec-
full BibSonomy test sets. Symbols * and + indicate a statisti-                tion 3), our heuristic does not require large user profiles, as it draws
cally significant difference (see Section 4.4 for details).                   the recommendations only from the interactions in the tagging sys-
                                                                              tem directly preceding a new post. Moreover, the heuristic is easy
                                  Bib1           Bib2                         to implement into arbitrary tagging systems, and requires only data
                BLL               0.265          0.315                        collected from browsing activities. While, in our experiments, we
                LRt =1m + BLL     0.274*+        0.318+                       made use of the request logs to exploit interactions, in a production
                                                                              environment tags can be derived directly from the interactions and
Testing with all posts as entities, significance is confirmed for both        can be stored into a temporary cache. Thus, the recommendations
Bib1 and Bib2 . Thus, we can conclude that exploiting user inter-             can be computed directly without accessing additional data sources.
actions in very short time-windows immediately before posting a               Relying solely on counting occurrences and restrictions on small
resource can boost the performance of the already well-performing             subsets of interactions, recommendations can be computed online
recommender algorithm BLL.                                                    without previous training. Consequently, they require only little
                                                                              effort, making them ideal candidates for systems where tagging is
                                                                              included as a secondary feature and for quickly prototyping a new
6    DISCUSSION AND LIMITATIONS                                               tagging system.
In our studies, we found that the time-window based recommender               Explainability: Finally, it is worth pointing out that, in our ap-
LRt provides better recommendations than the recommender LRn ,                proach, the choice of the recommended tags can be easily explained.
which uses the last requests. We also saw that in those situations            Explanations are suitable for increasing users’ acceptance of the
where it is applicable its results are comparable to the more complex         recommendations, particularly as the explanations reveals that no
algorithm BLL. Combining the time-window variant with BLL into                profiling of the user is necessary as only the few recent activities
a hybrid significantly improves the performance. In the following             are exploited.
we discuss several aspects and limitations of our study.
Timing: Overall, we could demonstrate the applicability of our                7    CONCLUSION
time-aware interaction-driven heuristic. We saw particularly that             In this work, we have proposed a time-aware, interaction-driven
short time spans immediately before the posting of a resource yield           tag recommendation heuristic, for the first time leveraging user
good recommendations. It seems that users store posts in bursts,              interaction in a tagging system beyond the publicly visible posting.
each representing different aspects of their (shifting) interests. Thus,      We have evaluated our approaches on data of the real-world tagging
relying on very recent tags is a reasonable approach.                         system BibSonomy. We have shown that time is a critical factor
Applicability: The number of posts and users for which the time-              and that particularly interactions immediately before a new post
window based recommender LRt can provide tag recommendations                  are a good source for recommendable tags.
is only a relatively small subset of the data. One reason for this               Finally, combining the time-aware approach with BLL leads to
phenomenon may be the fact that BibSonomy offers browser ex-                  a hybrid recommender that outperforms both individual compo-
tensions for posting resources, thus a useful means of posting to             nents. The result is an effective recommender system that is easy
BibSonomy without visiting the system first.                                  to implement and independent of the tagged resources and that
Generalizability of the Results: Since we could evaluate our                  requires no offline training. The approach is thus not only suitable
recommendation approach only on two BibSonomy datasets (due                   for dedicated tagging systems, but also for broader web systems in
to the unavailability of suitable data from other systems), it remains        which tagging is merely a secondary feature.
an open research question to see how it would perform in other                Future work: In this study, we have evaluated LRt and LRn with
tagging systems. Heckner et al. [8] found that users of tagging               fixed parameters (time-window length or number of considered
systems with different resources tend to tag for different reasons,           interactions) for all users. A chance for improving the results fur-
for example, Flickr (images) is used mostly for sharing, Delicious            ther would be an analysis of personalized parameters for each
(websites) mainly for retrieving resources from one’s own collection.         user or different user groups. It would also be conceivable to use
Since the resource types in BibSonomy and Delicious are similar               more refined methods to detect the user’s current context when
(references to documents; websites in both systems, publications              posting a new resource. For example, by looking at the requests,
only in BibSonomy), we hypothesize that they are often used for               it might be possible to distinguish situations where users do re-
similar purposes. Also, most page types (e.g., a user page) exist             search regarding one specific topic from situations where users just
in both systems. Thus, we would expect results of recommender                 “stumble” through the system changing their focus based on what
algorithm experiments conducted on Delicious to be qualitatively              they find on each page. Another topic for future work is the tag
similar to those found for BibSonomy. Another influence on the                candidate extraction. For this study, we extracted only the directly
usage of tagging systems is the user interface which varies from              requested tags or query terms from a request. In tagging systems,
system to system. For example, BibSonomy always links the entities            one could leverage the semantic context of such tags (arising from
of the folksonomy, while other systems may exclude some links or              the co-occurrence with other tags on the same resources). Finally,
place links on different positions within a page.                             instead of exploiting retrieval interaction on the level of individual
Transferability of the Approach: While the above mentioned is-                requests, one could attempt to identify sessions. These might yield
sues are limitations to the generalizability of the results in our study,     a more comprehensive understanding of the current user context
our methods of extracting tag recommendations from interactions               than considering individual requests independently.
can easily be adapted to other tagging systems.
Leveraging User-Interactions for Time-Aware Tag Recommendations                           RecTemp Workshop @ ACM RecSys ’17, August 2017, Como, Italy


REFERENCES                                                                                [11] Marek Lipczak, Yeming Hu, Yael Kollet, and Evangelos Milios. 2009. Tag Sources
 [1] Dominik Benz, Andreas Hotho, Robert Jäschke, Beate Krause, Folke Mitzlaff,                for Recommendation in Collaborative Tagging Systems, See [6], 157–172.
     Christoph Schmitz, and Gerd Stumme. 2010. The social bookmark and publication        [12] Feichao Ma, Wenqing Wang, and Zhihong Deng. 2013. TagRank: A new tag
     management system BibSonomy. The VLDB Journal 19, 6 (2010), 849–875. DOI:                 recommendation algorithm and recommender enhancement with data fusion
     https://doi.org/10.1007/s00778-010-0208-4                                                 techniques. In Social Media Retrieval and Mining, Shuigeng Zhou and Zhiang Wu
 [2] Robin Burke. 2010. Evaluating the Dynamic Properties of Recommendation                    (Eds.). Communications in Computer and Information Science, Vol. 387. Springer,
     Algorithms. In Proceedings of the Fourth ACM Conference on Recommender Sys-               Berlin/Heidelberg, 80–91. DOI:https://doi.org/10.1007/978-3-642-41629-3_7
     tems. ACM, New York, NY, USA, 225–228. DOI:https://doi.org/10.1145/1864708.          [13] Thomas Niebler, Martin Becker, Daniel Zoller, Stephan Doerfel, and Andreas
     1864753                                                                                   Hotho. 2016. FolkTrails: Interpreting Navigation Behavior in a Social Tagging
 [3] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of                   System. In Proceedings of the 25th ACM International on Conference on Information
     Recommender Algorithms on Top-n Recommendation Tasks. In Proceedings of                   and Knowledge Management. ACM, New York, NY, USA. DOI:https://doi.org/10.
     the Fourth ACM Conference on Recommender Systems. ACM, New York, NY, USA,                 1145/2983323.2983686
     39–46. DOI:https://doi.org/10.1145/1864708.1864721                                   [14] Maryam Ramezani. 2011. Improving graph-based approaches for personalized
 [4] Stephan Doerfel, Robert Jäschke, and Gerd Stumme. 2016. The Role of Cores in              tag recommendation. Journal of Emerging Technologies in Web Intelligence 3, 2
     Recommender Benchmarking for Social Bookmarking Systems. ACM Transac-                     (2011), 168–176. DOI:https://doi.org/10.4304/jetwi.3.2.168-176
     tions on Intelligent Systems and Technology 7, 3 (February 2016), 40:1–40:33. DOI:   [15] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Schmidt-Thieme
     https://doi.org/10.1145/2700485                                                           Lars. 2009. BPR: Bayesian personalized ranking from implicit feedback. In
 [5] Stephan Doerfel, Daniel Zoller, Philipp Singer, Thomas Niebler, Andreas Hotho,            Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. AUAI
     and Markus Strohmaier. 2016. What Users Actually do in a Social Tagging                   Press, Arlington, Virginia, United States, 452–461.
     System: A Study of User Behavior in BibSonomy. ACM Transactions on the Web           [16] Steffen Rendle and Lars Schmidt-Thieme. 2009. Factor Models for Tag Recom-
     10, 2 (2016), 14:1–14:32. DOI:https://doi.org/10.1145/2896821                             mendation in BibSonomy, See [6], 235–242.
 [6] Folke Eisterlehner, Andreas Hotho, and Robert Jäschke (Eds.). 2009. ECML PKDD        [17] Paul Seitlinger, Dominik Kowald, Christoph Trattner, and Tobias Ley. 2013. Rec-
     Discovery Challenge 2009 (DC09). CEUR-WS.org, Vol. 497.                                   ommending tags with a model of human categorization. In Proceedings of the
 [7] Asela Gunawardana and Guy Shani. 2009. A Survey of Accuracy Evaluation                    22nd International Conference on Conference on Information and Knowledge Man-
     Metrics of Recommendation Tasks. Journal of Machine Learning Research 10                  agement. ACM, New York, NY, USA, 2381–2386. DOI:https://doi.org/10.1145/
     (2009), 2935–2962.                                                                        2505515.2505625
 [8] M. Heckner, M. Heilemann, and C. Wolff. 2009. Personal Information Manage-           [18] Sanjay Sood, Sara Owsley, Kristian Hammond, and Larry Birnbaum. 2007. TagAs-
     ment vs. Resource Sharing: Towards a Model of Information Behaviour in Social             sist: Automatic Tag Suggestion for Blog Posts. In Proceedings of the 1st Interna-
     Tagging Systems. In Proceedings of the thrid AAAI Conference on Weblogs and               tional Conference on Weblogs and Social Media. Boulder, Colorado, USA.
     Social Media. San Jose, CA, USA.                                                     [19] Christoph Trattner, Dominik Kowald, Paul Seitlinger, Tobias Ley, and Simone
 [9] Andreas Hotho, Robert Jäschke, Christoph Schmitz, and Gerd Stumme. 2006.                  Kopeinik. 2016. Modeling Activation Processes in Human Memory to Predict the
     Information Retrieval in Folksonomies: Search and Ranking. In The Semantic                Use of Tags in Social Bookmarking Systems. Journal of Web Science 2, 1 (2016),
     Web: Research and Applications: Third European Semantic Web Conference, ESWC              1–16. DOI:https://doi.org/10.1561/106.00000004
     2006 Budva, Montenegro, June 11-14, 2006 Proceedings (LNCS), York Sure and           [20] Dawei Yin, Liangjie Hong, Zhenzhen Xue, and Brian D. Davison. 2011. Temporal
     John Domingue (Eds.), Vol. 4011. Springer Berlin Heidelberg, Berlin/Heidelberg,           Dynamics of User Interests in Tagging Systems. In Proceedings of the 25th AAAI
     411–426. DOI:https://doi.org/10.1007/11762256_31                                          Conference on Artificial Intelligence. AAAI.
                                                                                          [21] Lei Zhang, Jian Tang, and Ming Zhang. 2012. Integrating Temporal Usage
[10] Robert Jäschke, Leandro Marinho, Andreas Hotho, Lars Schmidt-Thieme, and
                                                                                               Pattern into Personalized Tag Prediction. In Proceedings of the 14th Asia-
     Gerd Stumme. 2008. Tag Recommendations in Social Bookmarking Sys-
                                                                                               Pacific International Conference on Web Technologies and Applications (AP-
     tems. AI Communications 21, 4 (2008), 231–247. DOI:https://doi.org/10.3233/
                                                                                               Web’12). Springer-Verlag, Berlin, Heidelberg, 354–365. DOI:https://doi.org/10.
     AIC-2008-0438
                                                                                               1007/978-3-642-29253-8_30