Location-aware online learning for top-k hashtag
                             recommendation

                                      Róbert Pálovics1,2 Péter Szalai1,2 Levente Kocsis1,4
                                      Júlia Pap1 Erzsébet Frigó1,3 András A. Benczúr1,3
                 1
                     Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI)
                       2
                         Technical University Budapest 3 Eötvös University Budapest 4 University of Szeged
                             {rpalovics, benczur, kocsis, pszalai, fbobee papjuli}@ilab.sztaki.hu

ABSTRACT                                                              We find that location and timing are the key factors with
In this paper we investigate the problem of recommend-             little contribution from personalized user interest. The lo-
ing Twitter hashtags for users with known GPS location,            cality of Twitter hashtag adoption in both spatial and tem-
learning online from the stream of geo-tagged tweets. Our          poral sense is observed among others by Kamath et al. [7].
method learns the relevance of regions in a geographical hi-       They state that “hashtags are a global phenomenon [. . . ]
erarchy, combined with the local popularity of the hashtag.        but distance between locations is a strong constraint on the
Unlike in typical collaborative filtering settings, trends and     adoption [. . . ] and follow a spray-and-diffuse pattern”.
geolocation turns out to be more important than personal-             We use a four-month collection of 400 million geotagged
ized user preferences. We evaluate in a time-aware setting,        Twitter messages detailed in [6]. We discard the text of
where evaluation is cumbersome by traditional measures,            the tweet messages and keep only the hashtags, the times-
since we have different top recommendations at different           tamp and the GPS coordinates. In our experiments we
times. We describe a time-aware framework based on in-             focus on the user location and the new hashtags that ap-
dividual item discounted gain.                                     pear in the message. As we have no information on which
                                                                   tweets are read by the users but we know the new hashtags
                                                                   they tweeted, we use the hashtag publishing information to
                                                                   measure user topic adoption. We consider a hashtag newly
1.     INTRODUCTION                                                adopted if we have not observed the given user-hashtag pair
   We investigate the problem of recommending Twitter hash-        before in the dataset. To guarantee at least one month for
tags for users, based on the temporal geolocation informa-         each user-hashtag pair without activity, we simply skip the
tion of both the users and the hashtags. Our aim is to rec-        first month of the stream of new hashtag usages.
ommend new hashtags, i.e. hashtags that the user has not              Some content may have obvious connection to certain lo-
used before. The recommendations are obtained by learn-            cations but others can have more widespread interest on
ing online from the stream of geo-tagged tweets. In our task       different levels such as language, continent, or even world-
the novel element is that neither users nor items (hashtags)       wide. Dealing with this, our models rely on the hierarchy of
are bound to one single location. Hashtags may in fact re-         regions from a global or continent-wide level down to a vil-
late to certain locations as well as be popular worldwide.         lage or city district to attribute the momentary popularity
Earlier results on recommendation in location-based social         of a hashtag to levels of locations. This hierarchical property
networks surveyed in e.g. [1, 13] combine spatial ratings for      of locations is surveyed also in [1]. We use the open hier-
non-spatial items, nonspatial ratings for spatial items, and       archical database of Global Administrative Areas (GADM,
spatial ratings for spatial items [10]. Our new results address    http://gadm.org). We mention that the metadata of tweets
the problem of the fuzzy relation of users and hashtags with       may contain not only GPS coordinates but also a place at-
locations.                                                         tribute that can contain the name and type of the place.
   Since hashtag usage is highly volatile, the problem calls for   However, we found the place attribute often ambiguous and
an online method. Whenever a user sends a geotagged tweet          less reliable.
with a hashtag he or she has not used earlier, we consider the        We use two models to recommend hashtags at a given time
event as a trigger for recommendation. We measure the ac-          and location, one based on the estimated probability of the
curacy of our methods in the online evaluation framework of        hashtag appearance based on its recency, and another based
[12] based on discounted cumulative gain (DCG) computed            on its temporal popularity. In both cases, our new method
individually for each event and averaged over time.                learns the importance of each node in the GADM tree. The
                                                                   final prediction arises as the weighted combination of the
                                                                   hashtag probabilities along the path of the GADM tree from
                                                                   the leaf location of the user up to the root.
                                                                      As baseline method, we use online matrix factorization
                                                                   [12]. Surprisingly, it turns out that matrix factorization per-
                                                                   forms much weaker than the distance based methods and
                                                                   contributes relatively little to the final prediction. This ob-
                                                                   servation justifies the importance of the temporal and geo-
Copyright held by the author(s).                                   graphic context of Twitter messages.
LocalRec’15, September 19, 2015, Vienna, Austria.
1.1    Related work                                                         Table 1: Properties of the cleansed dataset.
   Most of previous publications on geographic recommender
                                                                                           number of records       6,978,478
systems work with check-in data, where each of the items has
                                                                          number of unique user-hashtag pairs      2,993,183
a predefined static location. For brevity, we do not survey
these here. Hashtag recommendations are addressed in two                                     number of users       792,860
recent papers: Chen et al. [3] give methods for efficiently                              number of hashtags        268,489
maintaining a sliding window for time aware recommenda-                                  number of countries       49
tion, and Diaz et al. [5] introduce methods to compute ma-
trix factorization online. These results are orthogonal to our
exploitation of the location information.
                                                                       Most of the hashtags in the database are quite rare, thus
   Spatial statistics of hashtag adoption are analyzed by Ka-
                                                                    we use only the hashtags that appear more than 5 times.
math et al. [7]. Cheng et al. [4] give methods to geolocalize
                                                                    This way we exclude about 90% of the hashtags, but most of
tweets based on content. Mocanu et al. [11] use a data set
                                                                    the hashtag timeline remains. We also exclude the hashtags
similar to ours to analyze geographical properties like ho-
                                                                    that appear in the first month of the collection to recom-
mogeneity and seasonal patterns of language usage at scales
                                                                    mend newly spreading hashtags for the users. The proper-
ranging from country-level to city neighborhoods. Similar
                                                                    ties of the final cleansed dataset are summarized in Table 1.
to our use of the Global Administrative Areas, regions-of-
                                                                       We collected all 214,230 nodes from the GADM database,
interests partitioning is examined in [9] by applying k-means
                                                                    from which 190,315 are leaves. The depth of the tree is 6,
clustering to establish natural regions over Twitter data.
                                                                    and includes 5 levels from the GADM tree plus continent-
None of these papers exploit the results in recommender
                                                                    country relations. The hashtag time series data covered
systems.
                                                                    30,450 leaves from the tree.
   No other results use external data to define the hierarchy
of locations for recommendation tasks. Similar to our result,
in [6], GADM is used over the same Twitter data set, but            4.    MODELING
only for visualization purposes.
                                                                    4.1    Recommendation by location hierarchy
                                                                       In our recommendation model we use the location ` at
2.    ONLINE RECOMMENDATION AND                                     time t of the user. Here ` notes the leaf of the tree that is
      EVALUATION                                                    closest to the current GPS location of the user. First, we
   We use the online recommendation framework described             get the path in the tree from the root node to location `,
in [12], in which model training and evaluation happen si-          Path(`). Next, for a given hashtag, assume we have a rec-
multaneously, iterating over the dataset only once, in chrono-      ommendation method that yields scores s(h, n, t) for nodes
logical order. Whenever we see a new tweet, we assume that          n along Path(`). We will give two such methods in the next
the user becomes active and reveals its location to the recom-      subsections. In order to aggregate the individual recommen-
mender system. Next, we recommend hashtags of potential             dation at each GADM tree node, we propose the formula
interest for the user. The recommendation is online, hence                                       X
it depends on the context at the exact time instance of the                     r̂(u, h, `, t) =       wn · s(h, n, t),
tweet. If a user u tweets with hashtag h at time t in loca-                                    n∈Path(`)

tion `, our models give a score r̂(u, h0 , `, t) for each hashtag   where wn values are node specific weights. The weights wn
h0 seen so far, and recommend to u the k hashtags with the          are independent of the hashtags and characterize the area
largest values from those that u has not used before.               n only. We learn the weights by online gradient descent by
   The data is implicit: the events imply only that the user is     optimizing for RMSE. If we consider all positive instances
interested in a hashtag. In most of our models, we need nega-       and generate negative ones as described in Section 2, we will
tive instances as well for training. We use all hashtag usages      have sufficiently many implicit data to update the weights
as positive training instances and generate negative training       online as we read the sequence of events.
instances by selecting negRate random hashtags uniformly              In our experiments we also investigate models where we
at the time when a user first used a hashtag. We tested the         set all wn values constant, i.e. we do not learn the weights,
negRate parameter between 1 and 300.                                but simply sum all s(h, n, t) values along Path(`).
   We use the quality metric of [12] that we adopt to hash-
tag recommendation. If h is the new hashtag in the message          4.2    Temporal popularity
and the rank of h returned by the recommender system is                Given a predefined time discretization that we test be-
rank(h), then the discounted cumulative gain, DCG@k of              tween a minute and a day, for each location in the tree, we
this event is           1          if rank(h) ≤ k, and 0 oth-       compute the number of occurrences of the hashtag in the
               log2 (rank(h) + 1)
erwise. The overall evaluation of a model is the average            time interval at the given location. As it follows power-law
cumulative DCG@k.                                                   distribution, we use the logarithm of the temporal popular-
                                                                    ity values as node scores: s(h, n, t) = log(pop(h, n, t)), where
                                                                    pop(h, n, t) denotes the number of occurrences of hashtag h
3.    TWITTER AND GEOGRAPHICAL DATA                                 in node n in the time interval ending at time t.
   Dobos et al. [6] collected the dataset using the Twitter
open API by requesting geotagged tweets. We used the data           4.3    Hashtag recency
between February 1 and May 30, 2012 with February for                 Our next method estimates the chance of the appearance
training and observing distributions only, hence the online         of a hashtag by considering its most recent usage. The ad-
learning period lasts three months.                                 vantage of this method is that it is more sensitive to changes
              1e+06                                                                               0.4
             100,000
                                                                                                 0.35


                                                                    average cumulative DCG@100
N(IET = t)

              10,000
               1,000                                                                              0.3
                 100
                  10                                                                             0.25
                    1
                  0.1                                                                             0.2
                  1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07
                                                                                                 0.15
                                        t (sec)
                                                                                                  0.1
                        Figure 1: Inter-event distribution.                                                     world              tree
                                                                                                 0.05           leaves             tree with learned node weights
                                                                                                                 countries
                                                                                                   0
in trends. While it may more aggressively overfit to single                                             2   4     6      8    10   12      14   16     18     20
events, overall it performs similar to and combines very well
                                                                                                                             time (days)
with the popularity based method. In Fig. 1, we investigate
the distribution of the time elapsed between the same hash-         Figure 2: Average cumulative DCG@100 for the tree based
tag appearing in tweets. This inter-event time distribution         popularity model and its baseline variants.
follows power law, in accordance with several earlier obser-
vations [2, 14, 8], P (τ = t) = (α − 1) · t−α , whence we easily
get
                                                 (1−α)
                                                                    5.2                             Popularity and recency based methods
                                               ∆t                      The temporal popularity based method using the GADM
     P (t < τ ≤ t + ∆t | τ > t) = 1 − 1 +                .   (1)
                                                t                   tree of Section 4.2 achieved best results by setting the time
                                                                    frame around 2 hours. In the recency based model of Sec-
For location sensitive prediction we maintain the last ap-          tion 4.3, the parameter ∆t had relatively little effect, we set
pearance of each hashtag for every node in the geolocation          ∆t = 12h. We compared the popularity and recency based
tree. We compute the estimate of (1) in each node by using          methods separately in Figs. 2 and 3, resp., by using differ-
the global measured value α = 1.2.                                  ent levels of the GADM tree and turning recency and node
4.4             Online matrix factorization                         weight learning on and off.
                                                                       In the figures, “world” denotes the methods that use global
  We apply stochastic gradient descent factorization for the        values only and do not take user geolocation into account,
user-hashtag matrix as in [12]. Batch stochastic gradient           while “leaves” and “countries” use the corresponding level of
descent iterates several times over the training set until con-     the tree. Note that country level popularity and recency
vergence. Online recommenders seem to be more restricted            performed very well while leaves worked only with recency
than those that may iterate over the data set several times.        not popularity. Note that using countries but no temporal
However, online matrix factorization proved to be superior          information at all performs the poorest.
to batch in [12], since it gives much more emphasis on recent          Best performance is obtained when using the whole tree
events.                                                             for recommendation by adding all recency values along the
                                                                    path corresponding to the current user location in the tree,
5.             EXPERIMENTS                                          marginally improving the country based results. However,
  In our graphs we show the average cumulative DCG for              by applying the gradient method to learn node specific weights
the first three weeks, by when all of our methods reach stable      as in Section 4.1, we could achieve significantly better results
performance. Here the cumulative average corresponds to             for recency but not for popularity. By using the recency
cumulative time average. We set k = 100 to compare our              based tree learning algorithm, we were able to focus on the
methods in detail. All methods show a slight performance            active and representative part of the tree. We achieved our
degradation, which is due to the fact that the number of            best results with lRate=0.0001 and negRate=4.
possible hashtags to recommend increase in time.
                                                                    5.3                             Online combination
5.1             Online matrix factorization                            In our final experiments we compared and combined our
   We used the online version of stochastic gradient descent        strongest methods. In Figure 4 we plotted the average cu-
(SGD) matrix factorization algorithm of [12]. We applied            mulative DCG@100 as the function of time for our best mod-
the mean square error with user and item regularization             els. Surprisingly, the tree based methods strongly outper-
terms of weight regRate as our objective function. Since            form online matrix factorization, while the best popularity
our data is implicit and contains only positive interactions,       based model overtook the best recency based method a little
we generated negative samples. Every time a user first posts        bit in the long run. Next, we considered the strongest one,
about a hashtag, we generate for her negRate hashtags that          the tree based popularity without node weight learning, to
she have not used in her past as described in Section 2. In         improve it by combining it with the best factor model and
all cases, we set negRate=99, learning rate lRate=0.4, and          recency recommender. We used the SGD based double layer
regRate=0.01.                                                       combination method introduced in [12] with mean squared
   As we show next, our tree based methods and their base-          error as objective function. In Figure 5 we show the results
line variants to exploit geographical information resulted in       of the combination. The popularity model can be improved
better performance in our experiments.                              by using the best recency method that uses the tree with
                              0.4                                                                        Table 2: Best performance methods and their combination,
                                                                                                         with relative improvement.
                             0.35
average cumlative DCG@100


                              0.3                                                                                                            DCG@100       DCG@10
                                                                                                              factorization                    0.206         0.180
                             0.25                                                                             recency w/ tree learning         0.355         0.323
                              0.2                                                                             popularity w/ tree learning      0.359         0.335
                                                                     world                                    popularity                       0.374         0.342
                             0.15                                    leaves                                   + recency                       (4.1% )        (2% )
                                                                     countries                                popularity                       0.381          0.35
                              0.1                                    countries without recency
                                                                     tree
                                                                                                              + recency + factor              (6.1% )       (4.2% )
                             0.05                                    tree with learned node weights

                               0
                                     2     4   6       8       10    12        14    16     18      20
                                                                                                         exploit the time and location context. Surprisingly, user
                                                                                                         personalization has little contribution to recommendation
                                                           time (days)                                   quality, hence our best methods apply in the user cold start
                                                                                                         setting as well.
Figure 3: Average cumulative DCG@100 curves of the re-
cency based methods.
                                                                                                         7.     REFERENCES
                                     0.4                                                                  [1] J. Bao, Y. Zheng, D. Wilkie, and M. F. Mokbel. A survey
average cumulative


                                    0.35                                                                      on recommendations in location-based social networks.
                                     0.3                                                                      ACM Trans. Intelligent Systems and Technology, 2013.
    DCG@100


                                    0.25                                                                  [2] A.-L. Barabási. The origin of bursts and heavy tails in
                                     0.2                                                  factor
                                                                                                              human dynamics. Nature, 435(7039):207–211, 2005.
                                    0.15                                                  pop             [3] C. Chen, H. Yin, J. Yao, and B. Cui. Terec: A temporal
                                     0.1                                                  recency             recommender system over tweet stream. Proceedings of the
                                    0.05                                                                      VLDB Endowment, 6(12):1254–1257, 2013.
                                           2   4   6       8        10    12    14    16     18     20    [4] Z. Cheng, J. Caverlee, and K. Lee. You are where you
                                                                time (days)                                   tweet: a content-based approach to geo-locating twitter
                                                                                                              users. In Proc. CIKM, pp. 759–768. ACM, 2010.
Figure 4: Best performances achieved with the factor, the                                                 [5] E. Diaz-Aviles, L. Drumond, L. Schmidt-Thieme, and
popularity, and the recency based recommenders.                                                               W. Nejdl. Real-time top-n recommendation in social
                                                                                                              streams. In Proc. RecSys, pp. 59–66. ACM, 2012.
                                                                                                          [6] L. Dobos, J. Szüle, T. Bodnár, T. Hanyecz, T. Sebők,
                              0.4                                                                             D. Kondor, Z. Kallus, J. Stéger, I. Csabai, and G. Vattay.
                                                                                                              A multi-terabyte relational database for geo-tagged social
                             0.38
average cumulative DCG@100


                                                                                                              network data. In Cognitive Infocommunications
                                                                                                              (CogInfoCom), 2013 IEEE 4th International Conference
                             0.36                                                                             on, pp. 289–294. IEEE, 2013.
                             0.34                                                                         [7] K. Y. Kamath, J. Caverlee, K. Lee, and Z. Cheng.
                                                                                                              Spatio-temporal dynamics of online memes: a study of
                             0.32                                                                             geo-tagged tweets. In Proc. WWW, pp. 667–678, 2013.
                                                                                                          [8] M. Kivelä and M. A. Porter. Estimating inter-event time
                              0.3                                                                             distributions from finite observation periods in
                                                                                                              communication networks. arXiv preprint arXiv:1412.8388,
                             0.28                          popularity                                         2014.
                                                           popularity + recency                           [9] R. Lee and K. Sumiya. Measuring geographical regularities
                             0.26
                                                           popularity + recency + factor                      of crowd behaviors for twitter-based geo-social event
                             0.24                                                                             detection. In Proceedings of the 2nd ACM SIGSPATIAL
                                                                                                              international workshop on location based social networks,
                                     2     4   6   8        10       12        14    16    18       20
                                                                                                              pp. 1–10. ACM, 2010.
                                                           time (days)                                   [10] J. J. Levandoski, M. Sarwat, A. Eldawy, and M. F. Mokbel.
                                                                                                              Lars: A location-aware recommender system. In Proc.
        Figure 5: Combination of the best three different models.                                             ICDE, pp. 450–461. IEEE, 2012.
                                                                                                         [11] D. Mocanu, A. Baronchelli, N. Perra, B. Gonçalves,
                                                                                                              Q. Zhang, and A. Vespignani. The twitter of babel:
                                                                                                              Mapping world languages through microblogging platforms.
learned node weights. We could further improve our results
                                                                                                              PloS One, 8(4):e61981, 2013.
by including the factor model in our recommendation. In                                                  [12] R. Pálovics and A. A. Benczúr. Temporal influence over the
Table 2 we collected the overall performance of our best                                                      last.fm social network. Social Netw. Analys. Mining, 5(1):4,
methods and their combinations.                                                                               2015.
                                                                                                         [13] P. Symeonidis, D. Ntempos, and Y. Manolopoulos.
                                                                                                              Location-based social networks. In Recommender Systems
6.                             CONCLUSION AND FUTURE WORK                                                     for Location-based Social Networks, pp. 35–48. Springer,
  We gave online, location based learning methods for Twit-                                                   2014.
ter hashtag recommendation. Since hashtags are not directly                                              [14] A. Vazquez, B. Rácz, A. Lukács, and A.-L. Barabási.
bound to a location, may be geographically spread, and vary                                                   Impact of non-poissonian activity patterns on spreading
in popularity at different times, we designed methods that                                                    processes. Physical review letters, 98(15):158702, 2007.