-

Matrix Factorization for Near Real-time Geolocation Prediction in Twitter Stream

Nghia Duong-Trung

Nicolas Schilling

schilling@ismll.uni-hildesheim.de 0

Lucas Rego Drumond

ldrumond@ismll.uni-hildesheim.de 0

Lars Schmidt-Thieme

schmidt-thieme@ismll.uni-hildesheim.de 0 0 Information Systems and Machine Learning Lab (ISMLL) Universitatsplatz 1 , 31141 Hildesheim , Germany

The geographical location is vital to geospatial applications such as event detection, geo-aware recommendation and local search. Previous research on this topic has investigated geolocation prediction framework via conducting pre-partitioning and applying classi cation methods. These existing approaches target user's geolocation all at once via concatenation of tweets. In this paper, we study a novel problem in geolocation. We aim to predict user's geolocation at a given tweet's posting time. We propose a geo matrix factorization model to address this problem. First, we map tweets into a latent space using a matrix factorization technique. Second, we use a linear combination in the latent space to predict exact latitude and longitude. However, we only use one individual tweet as the input instead of using a concatenation of all tweets of a user. Our experimental results show that the proposed model has outperformed a set of regression models and state-of-the-art classi cation approaches.

Twitter Near real-time Geolocation Matrix Factorization

In the past years, online social networking and social media sites, e.g. Twitter in general, have become an ubiquitous and constant mechanism for sharing and seeking information. Although a tweet's length is limited to 140 characters, there is still a huge amount of information to explore. Its contents are inherently multifaceted and dynamic; consequently, representing people's thoughts and public announcement at a temporal currency and vicinity. This causes Twitter data to become speci cally interesting for multi-purpose investigations as they are tweeted in near real-time fashion. Understanding the near real-time user's geographical location, e.g. latitude and longitude pairs or physical coordinates, enables providing policies and intervention aid strategies in a particular region such as localized aid [ 31,9 ], disaster response [ 27,21 ], event detection [ 28 ] and disease surveillance [ 3 ].

One of the early pioneer papers about geolocation in Twitter streams was published in 2010 [ 12 ]. In that work, the authors concatenated all user's tweets during a speci ed duration into one single representative document. The geolocation of the rst tweet or the rst available geo-tagged tweet in the collection was then the geolocation of the representative document. Using a concatenation provides circumstantial contents to develop a wide variety of techniques used in geo-locating such as content analysis with terms in a gazetteer [ 19 ], content analysis with probabilistic language models [ 11,16,1 ], metadata of various sorts such as follow-following relationships [ 17,22 ], behavior-based time zone [ 20 ]. Furthermore, the research conducted in [ 17 ] exploits the idea of geolocation prediction as label propagation by interpreting location labels spatially. Additionally, the work of [ 6 ] extends [ 17 ] by taking into account edge weights as a function re ecting user interactions.

Prerequisites to these directions are the representation of the earth's surface. Geolocations can be captured as points, or clusters based on a pre-partitioning of regions into discrete sub-regions using city locations [ 5,18,26 ], named entities and location indicative words [ 14 ] as well as vernacular expressions with the aid of comprehensive gazetteers [ 15 ]. Another approach of partitioning the earth's surface is to use a grid. While the simplest grid is a uniform rectangular one with cells of equal-sized degrees [ 30 ], more advanced grids are either an adaptive grid based on k -d trees [ 25 ], an equal-area quaternary triangular mesh [ 8 ] or a hierarchical structure [ 29 ].

However, these approaches have some drawbacks due to some reasons. First of all, as being classi cation methods, they heavily depend on pre-partitioning or a framing architecture that is used to split the regions into discrete sub-regions. Thus, they discard the natural properties of real physical coordinates. Moreover, concatenating tweets into one representative document requires a timeconsuming collection as well as data abundance. In addition, concatenation of tweets during a particular duration, e.g. a month, leads to failure of capturing geolocation in near real-time situations. E ective geolocation of a user while posting a single short tweet based purely on its content is a direction worthinvestigating and also constitutes a more di cult task.

In this paper, we address a novel geolocation prediction scenario via regression within indicative latent feature space. By working on the latent feature space, we have proved that regression models can be utilized to solve this prediction problem. We aim to predict the exact user geolocation at a given posting's time, simply based on the textual content of tweets, ignoring their metadata. 2

Proposed Method

In this section, we present the general notation used in this paper as well as our approach. It is based on a matrix factorization of the individual tweets where we then learn a latent representation of tweets and words. This latent representation will then be used to predict the nal geolocation. We also present a learning algorithm for our approach which is optimized by stochastic gradient descent. 2.1

Notation

Consider a dataset D containing a set of tweets where each tweet is described by n many features. The dataset will be split into a training Dtrain, a test Dtest and a validation Dvalid set, which will be used for hyperparameter optimization later. We have m, l and v tweets in the training Dtrain, test Dtest and validation Dvalid sets respectively. The tweet features are mapped from a dictionary that comprises all words/tokens/unigrams in the dataset. We denote the vocabulary size by jV j = n.

Each tweet is annotated with a ground-truth coordinate pair y 2 R2, y = (ylat; ylon) where ylat 2 R is the latitude and ylon 2 R is the longitude of the associated tweet. By yui = (yulait; ylon) we denote the average geolocation of a ui user in the training set, where yulait 2 R is the average latitude and yuloin 2 R is the average longitude. Using yU = (yUlat; ylon), we denote the average geolocation U of all users in the training set. Given some training data Xtrain 2 Rm n, and the respective labels Y train 2 Rm 2, we seek to learn a machine learning model f : Rn ! R2 which maps tweets to geolocations such that for some test data Xtest 2 Rv n, the sum of distances

v X d(f (Xitest); Yitest) i=1 (1) is minimal. By Y test 2 Rv 2 we denote the set of ground-truth labels for the test data. Note that, d is a distance metric where in our learning algorithm we use the Haversine distance. 2.2

The Geo Matrix Factorization Model

Over the last decade, Matrix Factorization (MF) models have gained much attention by the Net ix Prize competition where they have shown very good predictive performance as well as decent run-time complexity in terms of dealing with very sparse matrices. Based on the vanilla MF, we develop a more multi-relationaloriented factorization model for the geolocation regression task: the Geo Matrix Factorization (GMF) model. We approach the user geolocation problem as a text regression task where we aim to predict the exact latitude and longitude values using an individual tweet. However, instead of using the highly sparse word counts as features in a linear regression, we rstly factorize the input space by learning a matrix T 2 Rm k for tweets and W 2 Rk n for individual words of each tweet to reconstruct X as:

T W (2)

As in the usual setting, the number of latent features k is usually much smaller than the number of words n, such that through this approach, tweets are projected into a lower dimensional latent feature space. This latent representation of a tweet is then used within a linear model to predict the geolocation of the user at the posting time of the tweet:

where 2 Rk+1 and 2 Rk+1 are weight coe cients vectors for learning latitude and longitude respectively. Notice that we also actually perform two factorizations of X, one for latitude which yields T lat, this is done for longitude as well. Our model then actually predicts the average training location of a user, plus a regression term on the latent feature space obtained by the factorization of X. 2.3

Model Fitting

Given the model, we have to learn parameters T lat; T lon; W lat; W lon; ; , where the W matrices are only used for reconstructing X and not for predicting the actual geolocation. We optimize the prediction of the gelocation as well as the factorization of X for the least-squares error. In order to prevent the model from over tting to the training data we apply a Tikhonov regularization on the regression parameters and , the latent feature matrices are regularized using the Frobenius norm. The overall loss term for learning the parameters associated to predicting latitude then looks like

K y^lat = yulalt + 0 + X l

kTllkat k=1

K y^llon = yuloln + 0 + X kTllkon (3) (4) Llat(y^lat; ylat) =

1 j Xtrain j y^lat ylat 2 + k k 2 +

Xtrain

T latW lat 2F + T T lat 2F +

W lat 2F ; The regularization terms by regularization parameters larization.

These terms penalize parameters with high magnitudes, that typically lead to overly complex models with very small training errors but bad generalization performance. Certainly, these hyperparameters can not be learned from the data and will be optimized using a grid-search on the validation partition of the data. To solve the above optimization tasks, we apply Stochastic Gradient Descent (SGD) [ 2,13 ] where the learning rate is estimated using the Adaptive Subgradient Method (AdaGRAD) [ 10 ] which helps yielding a better run-time performance. The basic idea of SGD is that, instead of expensively calculating the gradient of Equation 4 and its latitude counterpart, it randomly selects a tweet and calculates the corresponding gradient. Suppose we have chosen a tweet indexed by m, the partial derivatives of Equation 4 with the respect to T lat can be computed as:

Xmn

K X T mlaktWklant k=1 ymlat yulamt kT mlakt 0

l K X k=1

Wllnat + T T mlalt

The partial derivatives with respect to the latent feature matrix W lat of the tokens is obtained by

Xmj

K X T mlaktWklajt T mlalt + k=1

W Wlljat

Finally, the partial derivative of the regression parameters has the form: as well as the respective term for longitude. (5) (6) (7) (8) = = ymlat ymlat yulamt yulamt

K X k=1 K X k=1 kT mlakt kT mlakt 0 T mlajt +

j 0 and

The partial derivatives of the longitude loss with the respect to T lon, W lon can be calculated in the exact same manner as Equations 5, 6 and 7. 2.4

Inference for Test Data

By optimizing the respective loss terms for the training data, we learn the latent representation T of all training tweets as well as the linear regression parameters and for predicting the nal geolocation. However, as we want to predict geolocations of unseen test tweets, the latent representations T for the individual training tweets cannot be employed. Out of this reason, we perform a fold-in, where we factorize the feature matrix Xtest of the test data, using the latent representation W of the word tokens that was learned on the training data. To avoid confusion, we denote the latent tweet representations for the test tweets by T 0lat and T 0lon and factorize Xtest as

As we can see, W lat and W lon are reused from the learning phase. Subsequently, in the fold-in phase, we de ne the objective function that we need to minimize for T 0lat as follows:

L lat Xtest; T 0latW lat =

1 jXtestj

The partial derivatives of Equation 9 with the respect to T 0lat can be computed by: =

Xjtnest

K X Tj0klatWklant k=1

Wklant + testT 0ljakt

The partial derivatives with the respect to T 0lon can be also computed in the same manner as for Equation 10. Having learned the latent representation of the test tweets using the fold-in procedure, we can then perform predictions for the test users using Equation 3. However, not all users that appear in the test data necessarily have to appear in the training data, hence we cannot use their average geolocation for the nal prediction. For those users, we then use the median geolocation of all users of the training data as: yul = (yul ; if ul 2 Dtrain yU ;

otherwise

Algorithm 1 illustrates how the overall GMF works. 3

Experiments

In this section, we rst describe the datasets that we use as well as their preprocessing. Additionally, we describe how we optimized the hyperparameters of our model. Finally, we compare our approach to a set of competing methods. 3.1

Dataset

We have worked with three publicly available tweet datasets containing geolocation information and compiled them to t the user geolocation prediction within the near real-time scenario. One dataset comprises the tweets posted within the United States, whereas the other dataset contains all tweets localized to north America and the world. Through this, we evaluate our model's e ectiveness and generality within di erent geographical scopes from a country to the whole world. A splitting protocol is then designed for these datasets. We randomly (9) (10) (11)

Algorithm 1 GMF

Require: Xtrain 2 Rm n, Xtest 2 Rl nk, Yn 2 Rm 2 Ensure: T 2 Rm k, T 0 2 Rl k, W 2 R , 2 Rk+1, split all tweets of each user by a 60/20/20 scheme, denoted as LocalRandom (LR). Secondly, we also investigate how our model works with a user appearing in the test set might not exist in the training data by splitting all tweets using the 60/20/20 scheme, called GlobalRandom (GR).

US. This dataset is originally implemented by [ 12 ], and was later also used in [ 11,30,16 ]. The dataset comprises tweets gathered from the "Gardenhose" sample stream in the rst week of March, 2010. In this dataset, the authors already provide geotagged tweets that we simply reuse. The implementing dataset contains 377,616 tweets posted by 9,475 users.

NA. The second dataset was collected by [ 25 ] and later implemented by [ 29,15 ]. This dataset contains tweets within north America, including the United States, parts of Canada and Mexico from September 4th to November 29th, 2011. Because Twitter does not allow the distribution of complete tweets at that time, the NA dataset only contains user IDs and tweet IDs. Subsequently, we have to fetch the tweets from Twitter using its o cial API to check whether the tweets are available as well as their availability of embedded coordinates. Only 226,595 tweets out of 38 million posted by 10,950 users have geotags available and therefore are considered for the nal dataset.

WORLD. The last dataset was compiled by [ 14 ] and later implemented by [ 29,15 ]. The dataset comprises tweets from all over the world. As being described in NA dataset, we also apply the same retrieving procedure. The implementing dataset then contains 121,327 tweets posted by 80,179 users. In the WORLD dataset, 70% of users has only one tweet. So that we only apply the GR 60/20/20 splitting scheme to it. 3.2

Data Preprocessing

In addition to length restriction, tweets are also characterized by the use of terms that are not found in natural language, including hashtags, abbreviations, emoticons and URLs. Through this, we propose a data preprocessing procedure as follows.

Tokenization. We apply a uni-gram tokenization procedure that preserves hashtags, @-replies, abbreviations, blocks of punctuation, emoticons and unicode glyphs and other symbols as tokens. We remove URL tokens to prevent the tweets where bots are posting information such as advertisement to enter our dataset.

Bag-of-words representation. After all tweets are tokenized, they are converted from sparse vectors of token counts into sparse vectors of bag-of-words representations using term frequency - inverse document frequency (TF.IDF) scores. By using the TF.IDF scores, we discard language and grammar structure, the token's order, semantics and meaning as well as part-of-speech. The TF.IDF weights re ect how important a token is to an instance. The more common a token is to many instances, the more penalization it gets. The tokens with the highest TF.IDF weight are often the tokens that best characterize the instance. 3.3

Evaluation Metrics

Given the ellipsoidal shape of the earth's surface, we apply the Haversine distance to calculate the distance of two points represented by their latitude in range of f 90; 90g and longitude in range of f 180; 180g. The Haversine distance dH : R2 R2 ! R is the great circle distance between two geographical coordinate pairs. We compute the distance between two points by the Haversine formula [ 24 ]. The formula of the central angle between them is given by: = sin2 j y^lat

ylatj + cos(ylat) cos(y^lat) sin2 j y^lon

ylonj where r is the radius of the earth. Because of the ellipsoidal shape of the earth, its radius varies from the equator to the poles. According to [ 7 ], we take the mean of the earth's radius which amounts to r = 6371 km. Finally, the evaluation metrics are the mean and median Haversine distances dH in kilometers between the ground-truth geolocation y and the predicted geolocation y^. In order to obtain good predictive performance, we also need to carefully tune the hyperparameters in our model. By k 2 N+ we denote the number of latent features used within the factorization of X. By T ; W ; ; and T 0 we denote the regularization hyperparameters used when learning the latent feature matrices, latent vocabulary matrices, the linear regression parameters for predicting latitude and longitude and the latent features matrices for the test tweets respectively. With T ; W ; ; and T 0 we denote the respective learning rates. We tune the hyperparameters by assessing the validation performance of our model and choosing the hyperparameter con guration which performs best. The number of latent dimensions is selected among the range of k 2 f2; 4; 8; 16g, while the value of all other hyperparameters are selected among the range of f0:1; 0:01; 0:001; 0:0001; 0:00001g. The preprocessed datasets used in the paper are publicly available unconditionally1. For the Support Vector Machine (SVM) and Factorization Machines (FM), we run them separately to predict latitude and longitude. To allow for a fair comparison, all these regression models also include the user bias in their estimation. Finally, we combine the predicted latitude and longitude to conduct a nal distance calculation. For these models, we also apply a grid-search mechanism to nd the best hyperparameter con gurations for each prediction of latitude and longitude. On each dataset, we repeat running the models 10 times and take the average results. The nal results can be observed in Table 1. We can see that all other regression models on average do not perform that well, mainly due to them using the extremely sparse 5; 200 TF.IDF features. Our model, however, maps each tweet individually into an eight-dimensional latent feature space and uses those features for prediction. The number of k latent feature is found by grid-search mechanism. The results show that GMF outperforms all competitors with large margins.

We also report the state-of-the-art results by classi cation approaches (see Table 2). One might notice that there are signi cant di erences in term of accuracy prediction in two geolocation prediction scenarios. By targeting user's geolocation at a given posting's time, the results show that our model signi cantly 1 Available online at: http://fs.ismll.de/publicspace/GMF/ reduces the localization error on the US and NA datasets. For the WORLD dataset, the average individual tweet's length is 5 tokens while being 49 tokens for the concatenation of tweets, our model still achieves reasonable results. We have investigated the geo matrix factorization model for the task of near real-time text-based geolocation in Twitter. In our work, we tackle the user geolocation prediction task in a regression perspective. We analyze a single tweet as the model's input without any concatenation. Through this, we can further predict the user trajectory and achieve geolocation at a given posting's time. This is a starting point for further investigation on the a ection of tweet concatenation or the number of tweets needed to achieve an acceptable distance error. Furthermore, We also address the sparsity and imbalance of online conversational texts by a matrix factorization technique. Based on the experiment results, our model outperforms all the competitors including SVM and FM within the regression task using dedicated latent feature spaces. In comparison with current state-of-the-art results by classi cation approaches, our model still outperforms and/or achieve reasonable results. Our further improvement broadly falls into various directions: optimization or applying the model over di erent datasets. In the optimization direction, we will analyze direct optimization of the Haversine formula. We also expand our model to predict near real-time geolocation of another types of datasets such as Wikipedia articles and Flickr images. Acknowledgments. Nghia Duong-Trung gratefully acknowledges the funding of his work by the Ministry of Education and Training of Vietnam under the national project no. 911.

1. Ahmed , A. , Hong , L. , Smola , A.J. : Hierarchical geographical modeling of user locations from social media posts . In: Proceedings of the 22nd international conference on World Wide Web . pp. 25 { 36 . International World Wide Web Conferences Steering Committee ( 2013 )

2. Bottou , L. : Large-scale machine learning with stochastic gradient descent . In: Proceedings of COMPSTAT'2010 , pp. 177 { 186 . Springer ( 2010 )

3. Burton , S.H. , Tanner , K.W. , Giraud-Carrier , C.G. , West , J.H. , Barnes , M.D.: " right time, right place" health communication on twitter: value and accuracy of location information . Journal of medical Internet research 14(6) ( 2012 )

4 . Chang , C.C. , Lin , C.J.: LIBSVM: A library for support vector machines . ACM Transactions on Intelligent Systems and Technology 2 , 27 :1{ 27 : 27 ( 2011 )

5. Cheng, Z. , Caverlee , J. , Lee , K. : You are where you tweet: a content-based approach to geo-locating twitter users . In: Proceedings of the 19th ACM international conference on Information and knowledge management . pp. 759 { 768 . ACM ( 2010 )

6. Compton , R. , Jurgens , D. , Allen , D. : Geotagging one hundred million twitter accounts with total variation minimization . In: Big Data (Big Data) , 2014 IEEE International Conference on . pp. 393 { 401 . IEEE ( 2014 )

7. Decker , B.L. : World geodetic system 1984 . Tech. rep., DTIC Document ( 1986 )

8. Dias , D. , Anastacio , I. , Martins , B. : A language modeling approach for georeferencing textual documents . In: Actas del Congreso Espan~ol de Recuperacion de Informacion ( 2012 )

9. Dredze , M. : How social media will change public health . Intelligent Systems, IEEE 27 ( 4 ), 81 { 84 ( 2012 )

10. Duchi , J. , Hazan , E. , Singer , Y. : Adaptive subgradient methods for online learning and stochastic optimization . The Journal of Machine Learning Research 12 , 2121 { 2159 ( 2011 )

11. Eisenstein , J. , Ahmed , A. , Xing , E.P. : Sparse additive generative models of text . In: Proceedings of the 28th International Conference on Machine Learning (ICML-11) . pp. 1041 { 1048 ( 2011 )

12. Eisenstein , J., O 'Connor , B. , Smith , N.A. , Xing , E.P.: A latent variable model for geographic lexical variation . In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing . pp. 1277 { 1287 . Association for Computational Linguistics ( 2010 )

13. Gemulla , R. , Nijkamp , E. , Haas , P.J. , Sismanis , Y. : Large-scale matrix factorization with distributed stochastic gradient descent . In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining . pp. 69 { 77 . ACM ( 2011 )

14. Han , B. , Cook , P. , Baldwin , T. : Geolocation prediction in social media data by nding location indicative words . Proceedings of COLING 2012 : Technical Papers pp. 1045 { 1062 ( 2012 )

15. Han , B. , Cook , P. , Baldwin , T. : Text-based twitter user geolocation prediction . Journal of Arti cial Intelligence Research pp. 451 { 500 ( 2014 )

16. Hong , L. , Ahmed , A. , Gurumurthy , S. , Smola , A.J. , Tsioutsiouliklis , K. : Discovering geographical topics in the twitter stream . In: Proceedings of the 21st international conference on World Wide Web . pp. 769 { 778 . ACM ( 2012 )

17. Jurgens , D. : That's what friends are for: Inferring location in online social media platforms based on social relationships . In: ICWSM ( 2013 )

18. Kinsella , S. , Murdock , V. ,

'Hare , N.: I'm eating a sandwich in glasgow: modeling locations with tweets . In: Proceedings of the 3rd international workshop on Search and mining user-generated contents . pp. 61 { 68 . ACM ( 2011 )

19. Li , R. , Wang , S. , Deng , H. , Wang , R. , Chang , K.C.C. : Towards social user proling: uni ed and discriminative in uence model for inferring home locations . In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining . pp. 1023 { 1031 . ACM ( 2012 )

20. Mahmud , J. , Nichols , J. , Drews , C. : Home location identi cation of twitter users . ACM Transactions on Intelligent Systems and Technology (TIST) 5 ( 3 ), 47 ( 2014 )

21. McClendon , S. , Robinson , A.C. : Leveraging geospatially-oriented social media communications in disaster response . International Journal of Information Systems for Crisis Response and Management (IJISCRAM) 5 ( 1 ), 22 { 40 ( 2013 )

22. McGee , J. , Caverlee , J., Cheng, Z. : Location prediction in social media based on tie strength . In: Proceedings of the 22nd ACM international conference on Conference on information & knowledge management . pp. 459 { 468 . ACM ( 2013 )

23. Rendle , S. : Factorization machines with libfm . ACM Transactions on Intelligent Systems and Technology (TIST) 3 ( 3 ), 57 ( 2012 )

24. Robusto , C. : The cosine-haversine formula . American Mathematical Monthly pp. 38 { 40 ( 1957 )

25. Roller , S. , Speriosu , M. , Rallapalli , S. , Wing , B. , Baldridge , J.: Supervised textbased geolocation using language models on an adaptive grid . In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning . pp. 1500 { 1510 . Association for Computational Linguistics ( 2012 )

26. Rout , D. , Bontcheva , K. , Preotiuc-Pietro , D. , Cohn , T. : Where's@ wally?: a classi cation approach to geolocating users based on their social ties . In: Proceedings of the 24th ACM Conference on Hypertext and Social Media . pp. 11 { 20 . ACM ( 2013 )

27. Sakaki , T. , Okazaki , M. , Matsuo , Y. : Earthquake shakes twitter users: real-time event detection by social sensors . In: Proceedings of the 19th international conference on World wide web . pp. 851 { 860 . ACM ( 2010 )

28. Weng , J. , Lee , B.S.: Event detection in twitter . ICWSM 11 , 401 { 408 ( 2011 )

29. Wing , B. , Baldridge , J.: Hierarchical discriminative classi cation for text-based geolocation . In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing . pp. 336 { 348 ( 2014 )

30. Wing , B.P. , Baldridge , J.: Simple supervised document geolocation with geodesic grids . In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 . pp. 955 { 964 . Association for Computational Linguistics ( 2011 )

31. Yin , J. , Lampert , A. , Cameron , M. , Robinson , B. , Power , R. : Using social media to enhance emergency situation awareness . IEEE Intelligent Systems (6) , 52 { 59 ( 2012 )