Location-aware online learning for top-k hashtag recommendation Róbert Pálovics1,2 Péter Szalai1,2 Levente Kocsis1,4 Júlia Pap1 Erzsébet Frigó1,3 András A. Benczúr1,3 1 Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI) 2 Technical University Budapest 3 Eötvös University Budapest 4 University of Szeged {rpalovics, benczur, kocsis, pszalai, fbobee papjuli}@ilab.sztaki.hu ABSTRACT We find that location and timing are the key factors with In this paper we investigate the problem of recommend- little contribution from personalized user interest. The lo- ing Twitter hashtags for users with known GPS location, cality of Twitter hashtag adoption in both spatial and tem- learning online from the stream of geo-tagged tweets. Our poral sense is observed among others by Kamath et al. [7]. method learns the relevance of regions in a geographical hi- They state that “hashtags are a global phenomenon [. . . ] erarchy, combined with the local popularity of the hashtag. but distance between locations is a strong constraint on the Unlike in typical collaborative filtering settings, trends and adoption [. . . ] and follow a spray-and-diffuse pattern”. geolocation turns out to be more important than personal- We use a four-month collection of 400 million geotagged ized user preferences. We evaluate in a time-aware setting, Twitter messages detailed in [6]. We discard the text of where evaluation is cumbersome by traditional measures, the tweet messages and keep only the hashtags, the times- since we have different top recommendations at different tamp and the GPS coordinates. In our experiments we times. We describe a time-aware framework based on in- focus on the user location and the new hashtags that ap- dividual item discounted gain. pear in the message. As we have no information on which tweets are read by the users but we know the new hashtags they tweeted, we use the hashtag publishing information to measure user topic adoption. We consider a hashtag newly 1. INTRODUCTION adopted if we have not observed the given user-hashtag pair We investigate the problem of recommending Twitter hash- before in the dataset. To guarantee at least one month for tags for users, based on the temporal geolocation informa- each user-hashtag pair without activity, we simply skip the tion of both the users and the hashtags. Our aim is to rec- first month of the stream of new hashtag usages. ommend new hashtags, i.e. hashtags that the user has not Some content may have obvious connection to certain lo- used before. The recommendations are obtained by learn- cations but others can have more widespread interest on ing online from the stream of geo-tagged tweets. In our task different levels such as language, continent, or even world- the novel element is that neither users nor items (hashtags) wide. Dealing with this, our models rely on the hierarchy of are bound to one single location. Hashtags may in fact re- regions from a global or continent-wide level down to a vil- late to certain locations as well as be popular worldwide. lage or city district to attribute the momentary popularity Earlier results on recommendation in location-based social of a hashtag to levels of locations. This hierarchical property networks surveyed in e.g. [1, 13] combine spatial ratings for of locations is surveyed also in [1]. We use the open hier- non-spatial items, nonspatial ratings for spatial items, and archical database of Global Administrative Areas (GADM, spatial ratings for spatial items [10]. Our new results address http://gadm.org). We mention that the metadata of tweets the problem of the fuzzy relation of users and hashtags with may contain not only GPS coordinates but also a place at- locations. tribute that can contain the name and type of the place. Since hashtag usage is highly volatile, the problem calls for However, we found the place attribute often ambiguous and an online method. Whenever a user sends a geotagged tweet less reliable. with a hashtag he or she has not used earlier, we consider the We use two models to recommend hashtags at a given time event as a trigger for recommendation. We measure the ac- and location, one based on the estimated probability of the curacy of our methods in the online evaluation framework of hashtag appearance based on its recency, and another based [12] based on discounted cumulative gain (DCG) computed on its temporal popularity. In both cases, our new method individually for each event and averaged over time. learns the importance of each node in the GADM tree. The final prediction arises as the weighted combination of the hashtag probabilities along the path of the GADM tree from the leaf location of the user up to the root. As baseline method, we use online matrix factorization [12]. Surprisingly, it turns out that matrix factorization per- forms much weaker than the distance based methods and contributes relatively little to the final prediction. This ob- servation justifies the importance of the temporal and geo- Copyright held by the author(s). graphic context of Twitter messages. LocalRec’15, September 19, 2015, Vienna, Austria. 1.1 Related work Table 1: Properties of the cleansed dataset. Most of previous publications on geographic recommender number of records 6,978,478 systems work with check-in data, where each of the items has number of unique user-hashtag pairs 2,993,183 a predefined static location. For brevity, we do not survey these here. Hashtag recommendations are addressed in two number of users 792,860 recent papers: Chen et al. [3] give methods for efficiently number of hashtags 268,489 maintaining a sliding window for time aware recommenda- number of countries 49 tion, and Diaz et al. [5] introduce methods to compute ma- trix factorization online. These results are orthogonal to our exploitation of the location information. Most of the hashtags in the database are quite rare, thus Spatial statistics of hashtag adoption are analyzed by Ka- we use only the hashtags that appear more than 5 times. math et al. [7]. Cheng et al. [4] give methods to geolocalize This way we exclude about 90% of the hashtags, but most of tweets based on content. Mocanu et al. [11] use a data set the hashtag timeline remains. We also exclude the hashtags similar to ours to analyze geographical properties like ho- that appear in the first month of the collection to recom- mogeneity and seasonal patterns of language usage at scales mend newly spreading hashtags for the users. The proper- ranging from country-level to city neighborhoods. Similar ties of the final cleansed dataset are summarized in Table 1. to our use of the Global Administrative Areas, regions-of- We collected all 214,230 nodes from the GADM database, interests partitioning is examined in [9] by applying k-means from which 190,315 are leaves. The depth of the tree is 6, clustering to establish natural regions over Twitter data. and includes 5 levels from the GADM tree plus continent- None of these papers exploit the results in recommender country relations. The hashtag time series data covered systems. 30,450 leaves from the tree. No other results use external data to define the hierarchy of locations for recommendation tasks. Similar to our result, in [6], GADM is used over the same Twitter data set, but 4. MODELING only for visualization purposes. 4.1 Recommendation by location hierarchy In our recommendation model we use the location ` at 2. ONLINE RECOMMENDATION AND time t of the user. Here ` notes the leaf of the tree that is EVALUATION closest to the current GPS location of the user. First, we We use the online recommendation framework described get the path in the tree from the root node to location `, in [12], in which model training and evaluation happen si- Path(`). Next, for a given hashtag, assume we have a rec- multaneously, iterating over the dataset only once, in chrono- ommendation method that yields scores s(h, n, t) for nodes logical order. Whenever we see a new tweet, we assume that n along Path(`). We will give two such methods in the next the user becomes active and reveals its location to the recom- subsections. In order to aggregate the individual recommen- mender system. Next, we recommend hashtags of potential dation at each GADM tree node, we propose the formula interest for the user. The recommendation is online, hence X it depends on the context at the exact time instance of the r̂(u, h, `, t) = wn · s(h, n, t), tweet. If a user u tweets with hashtag h at time t in loca- n∈Path(`) tion `, our models give a score r̂(u, h0 , `, t) for each hashtag where wn values are node specific weights. The weights wn h0 seen so far, and recommend to u the k hashtags with the are independent of the hashtags and characterize the area largest values from those that u has not used before. n only. We learn the weights by online gradient descent by The data is implicit: the events imply only that the user is optimizing for RMSE. If we consider all positive instances interested in a hashtag. In most of our models, we need nega- and generate negative ones as described in Section 2, we will tive instances as well for training. We use all hashtag usages have sufficiently many implicit data to update the weights as positive training instances and generate negative training online as we read the sequence of events. instances by selecting negRate random hashtags uniformly In our experiments we also investigate models where we at the time when a user first used a hashtag. We tested the set all wn values constant, i.e. we do not learn the weights, negRate parameter between 1 and 300. but simply sum all s(h, n, t) values along Path(`). We use the quality metric of [12] that we adopt to hash- tag recommendation. If h is the new hashtag in the message 4.2 Temporal popularity and the rank of h returned by the recommender system is Given a predefined time discretization that we test be- rank(h), then the discounted cumulative gain, DCG@k of tween a minute and a day, for each location in the tree, we this event is 1 if rank(h) ≤ k, and 0 oth- compute the number of occurrences of the hashtag in the log2 (rank(h) + 1) erwise. The overall evaluation of a model is the average time interval at the given location. As it follows power-law cumulative DCG@k. distribution, we use the logarithm of the temporal popular- ity values as node scores: s(h, n, t) = log(pop(h, n, t)), where pop(h, n, t) denotes the number of occurrences of hashtag h 3. TWITTER AND GEOGRAPHICAL DATA in node n in the time interval ending at time t. Dobos et al. [6] collected the dataset using the Twitter open API by requesting geotagged tweets. We used the data 4.3 Hashtag recency between February 1 and May 30, 2012 with February for Our next method estimates the chance of the appearance training and observing distributions only, hence the online of a hashtag by considering its most recent usage. The ad- learning period lasts three months. vantage of this method is that it is more sensitive to changes 1e+06 0.4 100,000 0.35 average cumulative DCG@100 N(IET = t) 10,000 1,000 0.3 100 10 0.25 1 0.1 0.2 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07 0.15 t (sec) 0.1 Figure 1: Inter-event distribution. world tree 0.05 leaves tree with learned node weights countries 0 in trends. While it may more aggressively overfit to single 2 4 6 8 10 12 14 16 18 20 events, overall it performs similar to and combines very well time (days) with the popularity based method. In Fig. 1, we investigate the distribution of the time elapsed between the same hash- Figure 2: Average cumulative DCG@100 for the tree based tag appearing in tweets. This inter-event time distribution popularity model and its baseline variants. follows power law, in accordance with several earlier obser- vations [2, 14, 8], P (τ = t) = (α − 1) · t−α , whence we easily get  (1−α) 5.2 Popularity and recency based methods ∆t The temporal popularity based method using the GADM P (t < τ ≤ t + ∆t | τ > t) = 1 − 1 + . (1) t tree of Section 4.2 achieved best results by setting the time frame around 2 hours. In the recency based model of Sec- For location sensitive prediction we maintain the last ap- tion 4.3, the parameter ∆t had relatively little effect, we set pearance of each hashtag for every node in the geolocation ∆t = 12h. We compared the popularity and recency based tree. We compute the estimate of (1) in each node by using methods separately in Figs. 2 and 3, resp., by using differ- the global measured value α = 1.2. ent levels of the GADM tree and turning recency and node 4.4 Online matrix factorization weight learning on and off. In the figures, “world” denotes the methods that use global We apply stochastic gradient descent factorization for the values only and do not take user geolocation into account, user-hashtag matrix as in [12]. Batch stochastic gradient while “leaves” and “countries” use the corresponding level of descent iterates several times over the training set until con- the tree. Note that country level popularity and recency vergence. Online recommenders seem to be more restricted performed very well while leaves worked only with recency than those that may iterate over the data set several times. not popularity. Note that using countries but no temporal However, online matrix factorization proved to be superior information at all performs the poorest. to batch in [12], since it gives much more emphasis on recent Best performance is obtained when using the whole tree events. for recommendation by adding all recency values along the path corresponding to the current user location in the tree, 5. EXPERIMENTS marginally improving the country based results. However, In our graphs we show the average cumulative DCG for by applying the gradient method to learn node specific weights the first three weeks, by when all of our methods reach stable as in Section 4.1, we could achieve significantly better results performance. Here the cumulative average corresponds to for recency but not for popularity. By using the recency cumulative time average. We set k = 100 to compare our based tree learning algorithm, we were able to focus on the methods in detail. All methods show a slight performance active and representative part of the tree. We achieved our degradation, which is due to the fact that the number of best results with lRate=0.0001 and negRate=4. possible hashtags to recommend increase in time. 5.3 Online combination 5.1 Online matrix factorization In our final experiments we compared and combined our We used the online version of stochastic gradient descent strongest methods. In Figure 4 we plotted the average cu- (SGD) matrix factorization algorithm of [12]. We applied mulative DCG@100 as the function of time for our best mod- the mean square error with user and item regularization els. Surprisingly, the tree based methods strongly outper- terms of weight regRate as our objective function. Since form online matrix factorization, while the best popularity our data is implicit and contains only positive interactions, based model overtook the best recency based method a little we generated negative samples. Every time a user first posts bit in the long run. Next, we considered the strongest one, about a hashtag, we generate for her negRate hashtags that the tree based popularity without node weight learning, to she have not used in her past as described in Section 2. In improve it by combining it with the best factor model and all cases, we set negRate=99, learning rate lRate=0.4, and recency recommender. We used the SGD based double layer regRate=0.01. combination method introduced in [12] with mean squared As we show next, our tree based methods and their base- error as objective function. In Figure 5 we show the results line variants to exploit geographical information resulted in of the combination. The popularity model can be improved better performance in our experiments. by using the best recency method that uses the tree with 0.4 Table 2: Best performance methods and their combination, with relative improvement. 0.35 average cumlative DCG@100 0.3 DCG@100 DCG@10 factorization 0.206 0.180 0.25 recency w/ tree learning 0.355 0.323 0.2 popularity w/ tree learning 0.359 0.335 world popularity 0.374 0.342 0.15 leaves + recency (4.1% ) (2% ) countries popularity 0.381 0.35 0.1 countries without recency tree + recency + factor (6.1% ) (4.2% ) 0.05 tree with learned node weights 0 2 4 6 8 10 12 14 16 18 20 exploit the time and location context. Surprisingly, user personalization has little contribution to recommendation time (days) quality, hence our best methods apply in the user cold start setting as well. Figure 3: Average cumulative DCG@100 curves of the re- cency based methods. 7. REFERENCES 0.4 [1] J. Bao, Y. Zheng, D. Wilkie, and M. F. Mokbel. A survey average cumulative 0.35 on recommendations in location-based social networks. 0.3 ACM Trans. Intelligent Systems and Technology, 2013. DCG@100 0.25 [2] A.-L. Barabási. The origin of bursts and heavy tails in 0.2 factor human dynamics. Nature, 435(7039):207–211, 2005. 0.15 pop [3] C. Chen, H. Yin, J. Yao, and B. Cui. Terec: A temporal 0.1 recency recommender system over tweet stream. Proceedings of the 0.05 VLDB Endowment, 6(12):1254–1257, 2013. 2 4 6 8 10 12 14 16 18 20 [4] Z. Cheng, J. Caverlee, and K. Lee. You are where you time (days) tweet: a content-based approach to geo-locating twitter users. In Proc. CIKM, pp. 759–768. ACM, 2010. Figure 4: Best performances achieved with the factor, the [5] E. Diaz-Aviles, L. Drumond, L. Schmidt-Thieme, and popularity, and the recency based recommenders. W. Nejdl. Real-time top-n recommendation in social streams. In Proc. RecSys, pp. 59–66. ACM, 2012. [6] L. Dobos, J. Szüle, T. Bodnár, T. Hanyecz, T. Sebők, 0.4 D. Kondor, Z. Kallus, J. Stéger, I. Csabai, and G. Vattay. A multi-terabyte relational database for geo-tagged social 0.38 average cumulative DCG@100 network data. In Cognitive Infocommunications (CogInfoCom), 2013 IEEE 4th International Conference 0.36 on, pp. 289–294. IEEE, 2013. 0.34 [7] K. Y. Kamath, J. Caverlee, K. Lee, and Z. Cheng. Spatio-temporal dynamics of online memes: a study of 0.32 geo-tagged tweets. In Proc. WWW, pp. 667–678, 2013. [8] M. Kivelä and M. A. Porter. Estimating inter-event time 0.3 distributions from finite observation periods in communication networks. arXiv preprint arXiv:1412.8388, 0.28 popularity 2014. popularity + recency [9] R. Lee and K. Sumiya. Measuring geographical regularities 0.26 popularity + recency + factor of crowd behaviors for twitter-based geo-social event 0.24 detection. In Proceedings of the 2nd ACM SIGSPATIAL international workshop on location based social networks, 2 4 6 8 10 12 14 16 18 20 pp. 1–10. ACM, 2010. time (days) [10] J. J. Levandoski, M. Sarwat, A. Eldawy, and M. F. Mokbel. Lars: A location-aware recommender system. In Proc. Figure 5: Combination of the best three different models. ICDE, pp. 450–461. IEEE, 2012. [11] D. Mocanu, A. Baronchelli, N. Perra, B. Gonçalves, Q. Zhang, and A. Vespignani. The twitter of babel: Mapping world languages through microblogging platforms. learned node weights. We could further improve our results PloS One, 8(4):e61981, 2013. by including the factor model in our recommendation. In [12] R. Pálovics and A. A. Benczúr. Temporal influence over the Table 2 we collected the overall performance of our best last.fm social network. Social Netw. Analys. Mining, 5(1):4, methods and their combinations. 2015. [13] P. Symeonidis, D. Ntempos, and Y. Manolopoulos. Location-based social networks. In Recommender Systems 6. CONCLUSION AND FUTURE WORK for Location-based Social Networks, pp. 35–48. Springer, We gave online, location based learning methods for Twit- 2014. ter hashtag recommendation. Since hashtags are not directly [14] A. Vazquez, B. Rácz, A. Lukács, and A.-L. Barabási. bound to a location, may be geographically spread, and vary Impact of non-poissonian activity patterns on spreading in popularity at different times, we designed methods that processes. Physical review letters, 98(15):158702, 2007.