Matrix Factorization for Near Real-time
     Geolocation Prediction in Twitter Stream

    Nghia Duong-Trung, Nicolas Schilling, Lucas Rego Drumond, and Lars
                             Schmidt-Thieme

            Information Systems and Machine Learning Lab (ISMLL)
                 Universitätsplatz 1, 31141 Hildesheim, Germany
    {duongn,schilling,ldrumond,schmidt-thieme}@ismll.uni-hildesheim.de
                     http://www.ismll.uni-hildesheim.de


      Abstract. The geographical location is vital to geospatial applications
      such as event detection, geo-aware recommendation and local search.
      Previous research on this topic has investigated geolocation prediction
      framework via conducting pre-partitioning and applying classification
      methods. These existing approaches target user’s geolocation all at once
      via concatenation of tweets. In this paper, we study a novel problem
      in geolocation. We aim to predict user’s geolocation at a given tweet’s
      posting time. We propose a geo matrix factorization model to address
      this problem. First, we map tweets into a latent space using a matrix
      factorization technique. Second, we use a linear combination in the latent
      space to predict exact latitude and longitude. However, we only use one
      individual tweet as the input instead of using a concatenation of all tweets
      of a user. Our experimental results show that the proposed model has
      outperformed a set of regression models and state-of-the-art classification
      approaches.

      Keywords: Twitter, Near real-time Geolocation, Matrix Factorization


1   Introduction
In the past years, online social networking and social media sites, e.g. Twitter
in general, have become an ubiquitous and constant mechanism for sharing and
seeking information. Although a tweet’s length is limited to 140 characters, there
is still a huge amount of information to explore. Its contents are inherently mul-
tifaceted and dynamic; consequently, representing people’s thoughts and public
announcement at a temporal currency and vicinity. This causes Twitter data
to become specifically interesting for multi-purpose investigations as they are
tweeted in near real-time fashion. Understanding the near real-time user’s ge-
ographical location, e.g. latitude and longitude pairs or physical coordinates,
enables providing policies and intervention aid strategies in a particular region
such as localized aid [31,9], disaster response [27,21], event detection [28] and
disease surveillance [3].
    One of the early pioneer papers about geolocation in Twitter streams was
published in 2010 [12]. In that work, the authors concatenated all user’s tweets
during a specified duration into one single representative document. The geolo-
cation of the first tweet or the first available geo-tagged tweet in the collection
was then the geolocation of the representative document. Using a concatenation
provides circumstantial contents to develop a wide variety of techniques used in
geo-locating such as content analysis with terms in a gazetteer [19], content anal-
ysis with probabilistic language models [11,16,1], metadata of various sorts such
as follow-following relationships [17,22], behavior-based time zone [20]. Further-
more, the research conducted in [17] exploits the idea of geolocation prediction as
label propagation by interpreting location labels spatially. Additionally, the work
of [6] extends [17] by taking into account edge weights as a function reflecting
user interactions.
    Prerequisites to these directions are the representation of the earth’s surface.
Geolocations can be captured as points, or clusters based on a pre-partitioning
of regions into discrete sub-regions using city locations [5,18,26], named entities
and location indicative words [14] as well as vernacular expressions with the aid
of comprehensive gazetteers [15]. Another approach of partitioning the earth’s
surface is to use a grid. While the simplest grid is a uniform rectangular one
with cells of equal-sized degrees [30], more advanced grids are either an adaptive
grid based on k -d trees [25], an equal-area quaternary triangular mesh [8] or a
hierarchical structure [29].
    However, these approaches have some drawbacks due to some reasons. First
of all, as being classification methods, they heavily depend on pre-partitioning or
a framing architecture that is used to split the regions into discrete sub-regions.
Thus, they discard the natural properties of real physical coordinates. More-
over, concatenating tweets into one representative document requires a time-
consuming collection as well as data abundance. In addition, concatenation of
tweets during a particular duration, e.g. a month, leads to failure of capturing
geolocation in near real-time situations. Effective geolocation of a user while
posting a single short tweet based purely on its content is a direction worth-
investigating and also constitutes a more difficult task.
    In this paper, we address a novel geolocation prediction scenario via regres-
sion within indicative latent feature space. By working on the latent feature
space, we have proved that regression models can be utilized to solve this pre-
diction problem. We aim to predict the exact user geolocation at a given posting’s
time, simply based on the textual content of tweets, ignoring their metadata.


2   Proposed Method

In this section, we present the general notation used in this paper as well as
our approach. It is based on a matrix factorization of the individual tweets
where we then learn a latent representation of tweets and words. This latent
representation will then be used to predict the final geolocation. We also present
a learning algorithm for our approach which is optimized by stochastic gradient
descent.
2.1   Notation
Consider a dataset D containing a set of tweets where each tweet is described
by n many features. The dataset will be split into a training Dtrain , a test Dtest
and a validation Dvalid set, which will be used for hyperparameter optimization
later. We have m, l and v tweets in the training Dtrain , test Dtest and validation
Dvalid sets respectively. The tweet features are mapped from a dictionary that
comprises all words/tokens/unigrams in the dataset. We denote the vocabulary
size by |V | = n.
     Each tweet is annotated with a ground-truth coordinate pair y ∈ R2 , y =
(y , y ) where y lat ∈ R is the latitude and y lon ∈ R is the longitude of the
   lat lon

associated tweet. By ȳui = (ȳulat
                                  i
                                     , ȳulon
                                           i
                                              ) we denote the average geolocation of a
user in the training set, where ȳui ∈ R is the average latitude and ȳulon
                                    lat
                                                                                i
                                                                                   ∈ R is
                                            lat lon
the average longitude. Using ȳU = (ȳU        , ȳU ), we denote the average geolocation
of all users in the training set. Given some training data X train ∈ Rm×n , and
the respective labels Y train ∈ Rm×2 , we seek to learn a machine learning model
f : Rn → R2 which maps tweets to geolocations such that for some test data
X test ∈ Rv×n , the sum of distances
                                v
                                X
                                      d(f (Xitest ), Yitest )                        (1)
                                i=1

is minimal. By Y test ∈ Rv×2 we denote the set of ground-truth labels for the
test data. Note that, d is a distance metric where in our learning algorithm we
use the Haversine distance.

2.2   The Geo Matrix Factorization Model
Over the last decade, Matrix Factorization (MF) models have gained much atten-
tion by the Netflix Prize competition where they have shown very good predictive
performance as well as decent run-time complexity in terms of dealing with very
sparse matrices. Based on the vanilla MF, we develop a more multi-relational-
oriented factorization model for the geolocation regression task: the Geo Matrix
Factorization (GMF) model. We approach the user geolocation problem as a
text regression task where we aim to predict the exact latitude and longitude
values using an individual tweet. However, instead of using the highly sparse
word counts as features in a linear regression, we firstly factorize the input space
by learning a matrix T ∈ Rm×k for tweets and W ∈ Rk×n for individual words
of each tweet to reconstruct X as:

                                        X ≈ TW                                       (2)
    As in the usual setting, the number of latent features k is usually much
smaller than the number of words n, such that through this approach, tweets
are projected into a lower dimensional latent feature space. This latent represen-
tation of a tweet is then used within a linear model to predict the geolocation
of the user at the posting time of the tweet:
                                                           K
                                                           X
                              ŷllat = ȳulat
                                           l
                                              + φ0 +                  lat
                                                                 φk Tlk
                                                           k=1
                                                                                                 (3)
                                                           K
                                                           X
                              ŷllon = ȳulon
                                           l
                                              + θ0 +                  lon
                                                                 θk Tlk
                                                           k=1

    where φ ∈ Rk+1 and θ ∈ Rk+1 are weight coefficients vectors for learning
latitude and longitude respectively. Notice that we also actually perform two
factorizations of X, one for latitude which yields T lat , this is done for longitude
as well. Our model then actually predicts the average training location of a user,
plus a regression term on the latent feature space obtained by the factorization
of X.


2.3   Model Fitting

Given the model, we have to learn parameters T lat , T lon , W lat , W lon , θ, φ, where
the W matrices are only used for reconstructing X and not for predicting the
actual geolocation. We optimize the prediction of the gelocation as well as the
factorization of X for the least-squares error. In order to prevent the model
from overfitting to the training data we apply a Tikhonov regularization on the
regression parameters θ and φ, the latent feature matrices are regularized using
the Frobenius norm. The overall loss term for learning the parameters associated
to predicting latitude then looks like

                                                 1                           2               2
                  Llat (ŷ lat , y lat ) =                  ŷ lat − y lat       + λφ kφk
                                             | X train |                                         (4)
                                             2                    2                      2
            + X train − T lat W lat F + λT T lat F + λW W lat F ,

   where the loss term associated to longitude Llon (ŷ lon , y lon ) is similar. The
only difference is that it involves θ, T lon and W lon . In Equation 4, the term
                       2
 X train − T lat W lat F is the residual error of transforming X into T lat , W lat .
                                        2                  2                         2
The regularization terms λφ kφk , λT T lat F , and λW W lat F are multiplied
by regularization parameters λφ , λT , and λW that control the amount of regu-
larization.
    These terms penalize parameters with high magnitudes, that typically lead
to overly complex models with very small training errors but bad generalization
performance. Certainly, these hyperparameters can not be learned from the data
and will be optimized using a grid-search on the validation partition of the data.
To solve the above optimization tasks, we apply Stochastic Gradient Descent
(SGD) [2,13] where the learning rate is estimated using the Adaptive Subgradient
Method (AdaGRAD) [10] which helps yielding a better run-time performance.
The basic idea of SGD is that, instead of expensively calculating the gradient
of Equation 4 and its latitude counterpart, it randomly selects a tweet and
calculates the corresponding gradient. Suppose we have chosen a tweet indexed
by m, the partial derivatives of Equation 4 with the respect to T lat can be
computed as:

                                                          K
               ∂Llat (ŷm
                        lat lat
                                                                           
                           , ym )        lat      lat
                                                         X
                                                                  lat
                         lat
                                  = −  y m   − ȳum   −      φ  T
                                                               k mk   −  φ 0 φl
                    ∂Tml                                 k=1
                                                                                   (5)
                   XN             K
                                    X                      
                                         lat   lat      lat           lat
                −             Xmn −    Tmk   Wkn      Wln     + λT Tml
                   n=1              k=1


   The partial derivatives with respect to the latent feature matrix W lat of the
tokens is obtained by

                                           K
             ∂Llat (ŷm
                      lat lat
                                                      
                         , ym )           X
                                               lat lat   lat
                                = − Xmj −     Tmk Wkj Tml    + λW Wljlat           (6)
                 ∂Wljlat                  k=1

      Finally, the partial derivative of the regression parameters has the form:

                                                  K
                   lat lat
          ∂Llat (ŷm
                                                                    
                      , ym )      lat
                                                 X
                             = − ym   − ȳulat −     φ   T
                                                       k mk
                                                           lat
                                                               − φ     lat
                                                                   0 Tmj + λφ φj
                 ∂φj                        m
                                                 k=1
                                                                                   (7)
            lat lat lat
                                                 K                  
          ∂L (ŷm , ym )          lat
                                                 X
                             = − ym   − ȳulat −     φ k T lat
                                                          mk   − φ 0
                 ∂φ0                        m
                                                     k=1


   The partial derivatives of the longitude loss with the respect to T lon , W lon
and θ can be calculated in the exact same manner as Equations 5, 6 and 7.


2.4     Inference for Test Data

By optimizing the respective loss terms for the training data, we learn the latent
representation T of all training tweets as well as the linear regression parameters
θ and φ for predicting the final geolocation. However, as we want to predict
geolocations of unseen test tweets, the latent representations T for the individual
training tweets cannot be employed. Out of this reason, we perform a fold-in,
where we factorize the feature matrix X test of the test data, using the latent
representation W of the word tokens that was learned on the training data. To
avoid confusion, we denote the latent tweet representations for the test tweets
      lat       lon
by T 0 and T 0      and factorize X test as
                                                   lat
                                    X test ≈ T 0         W lat                     (8)

as well as the respective term for longitude.
   As we can see, W lat and W lon are reused from the learning phase. Subse-
quently, in the fold-in phase, we define the objective function that we need to
                lat
minimize for T 0 as follows:

                                                                                                 
                     lat           1                       lat           2                 lat 2
    Llat X test , T 0 W lat =               X test − T 0         W lat       + λtest T 0
                                |X test |                                F                     F
                                                                                                   (9)
                                                                                  lat
   The partial derivatives of Equation 9 with the respect to T 0 can be com-
puted by:
                              
                    0 lat
   ∂Llat Xjn
           test
                , Tjk      lat
                          Wkn                  K                
                                                     0 lat                   0 lat
                                               X
                                        test                 lat   lat
                                 = −  Xjn    −     Tjk     W kn Wkn + λtest T jk
            ∂T 0 lat
                  jk                           k=1
                                                                                (10)
                                                                 lon
    The partial derivatives with the respect to T 0      can be also computed in
the same manner as for Equation 10. Having learned the latent representation
of the test tweets using the fold-in procedure, we can then perform predictions
for the test users using Equation 3. However, not all users that appear in the
test data necessarily have to appear in the training data, hence we cannot use
their average geolocation for the final prediction. For those users, we then use
the median geolocation of all users of the training data as:
                                 (
                                   ȳul , if ul ∈ Dtrain
                          ȳul =                                            (11)
                                   ȳU , otherwise
      Algorithm 1 illustrates how the overall GMF works.


3      Experiments

In this section, we first describe the datasets that we use as well as their pre-
processing. Additionally, we describe how we optimized the hyperparameters of
our model. Finally, we compare our approach to a set of competing methods.


3.1     Dataset

We have worked with three publicly available tweet datasets containing geoloca-
tion information and compiled them to fit the user geolocation prediction within
the near real-time scenario. One dataset comprises the tweets posted within the
United States, whereas the other dataset contains all tweets localized to north
America and the world. Through this, we evaluate our model’s effectiveness
and generality within different geographical scopes from a country to the whole
world. A splitting protocol is then designed for these datasets. We randomly
Algorithm 1 GMF
Require: X train ∈ Rm×n , X test ∈ Rl×n , Y ∈ Rm×2
Ensure: T ∈ Rm×k , T 0 ∈ Rl×k , W ∈ Rk×n , φ ∈ Rk+1 , θ ∈ Rk+1

 1: Initialize T lat ← N (0, 1), T lon ← N (0, 1), W lat ← N (0, 1),
                                                                  lat     lon
    W lon ← N (0, 1), φ ← N (0, 1), θ ← N (0, 1), T 0 ← N (0, 1), T 0         ← N (0, 1)
 2: // Learning phase
 3: for epoch ∈ 1, . . . , max epoch do
 4:    for iteration ∈ 1, . . . , M do
 5:         Pick m randomly
                    train
 6:         Pick Xmn       randomly
 7:         for k ∈ 1, . . . , K do
                              lat     lon      lat    lon
 8:            Learning Tmk       , Tmk   , Wkn    , Wkn  , φT mk , θT mk
 9:         end for
10:         Update φ0 , θ0
11:    end for
12: end for
13: // Fold-in phase
14: for epoch ∈ 1, . . . , max epoch0 do
15:    for iteration ∈ 1, . . . , L do
16:         Pick l randomly
                 test
17:         if Xln     exists then
18:            for k ∈ 1, . . . , K do
                                   lat       lon
19:              Learning T 0 lk , T 0 lk
20:            end for
21:         end if
22:    end for
23: end for
24: // Prediction
25: for l ∈ 1, . . . , L do
                                          lat
26:    ŷllat ← ȳulat
                    l
                       + φ0 + φlk T 0 lk
                                          lon
27:    ŷllon ← ȳulon
                     l
                        + θ0 + θlk T 0 lk
28: end for
29: return dH (y, ŷ)


split all tweets of each user by a 60/20/20 scheme, denoted as LocalRandom
(LR). Secondly, we also investigate how our model works with a user appearing
in the test set might not exist in the training data by splitting all tweets using
the 60/20/20 scheme, called GlobalRandom (GR).
    US. This dataset is originally implemented by [12], and was later also used
in [11,30,16]. The dataset comprises tweets gathered from the ”Gardenhose”
sample stream in the first week of March, 2010. In this dataset, the authors al-
ready provide geotagged tweets that we simply reuse. The implementing dataset
contains 377,616 tweets posted by 9,475 users.
    NA. The second dataset was collected by [25] and later implemented by
[29,15]. This dataset contains tweets within north America, including the United
States, parts of Canada and Mexico from September 4th to November 29th, 2011.
Because Twitter does not allow the distribution of complete tweets at that time,
the NA dataset only contains user IDs and tweet IDs. Subsequently, we have
to fetch the tweets from Twitter using its official API to check whether the
tweets are available as well as their availability of embedded coordinates. Only
226,595 tweets out of 38 million posted by 10,950 users have geotags available
and therefore are considered for the final dataset.
    WORLD. The last dataset was compiled by [14] and later implemented by
[29,15]. The dataset comprises tweets from all over the world. As being described
in NA dataset, we also apply the same retrieving procedure. The implementing
dataset then contains 121,327 tweets posted by 80,179 users. In the WORLD
dataset, 70% of users has only one tweet. So that we only apply the GR 60/20/20
splitting scheme to it.

3.2   Data Preprocessing
In addition to length restriction, tweets are also characterized by the use of
terms that are not found in natural language, including hashtags, abbreviations,
emoticons and URLs. Through this, we propose a data preprocessing procedure
as follows.
    Tokenization. We apply a uni-gram tokenization procedure that preserves
hashtags, @-replies, abbreviations, blocks of punctuation, emoticons and unicode
glyphs and other symbols as tokens. We remove URL tokens to prevent the tweets
where bots are posting information such as advertisement to enter our dataset.
    Bag-of-words representation. After all tweets are tokenized, they are con-
verted from sparse vectors of token counts into sparse vectors of bag-of-words
representations using term frequency - inverse document frequency (TF.IDF)
scores. By using the TF.IDF scores, we discard language and grammar struc-
ture, the token’s order, semantics and meaning as well as part-of-speech. The
TF.IDF weights reflect how important a token is to an instance. The more com-
mon a token is to many instances, the more penalization it gets. The tokens
with the highest TF.IDF weight are often the tokens that best characterize the
instance.

3.3   Evaluation Metrics
Given the ellipsoidal shape of the earth’s surface, we apply the Haversine distance
to calculate the distance of two points represented by their latitude in range of
{−90, 90} and longitude in range of {−180, 180}. The Haversine distance dH :
R2 × R2 → R is the great circle distance between two geographical coordinate
pairs. We compute the distance between two points by the Haversine formula
[24]. The formula of the central angle α between them is given by:

                                                                                                                12
                      |ŷ lat − y lat |                                                    |ŷ lon − y lon |
                                                                                    
              2                                         lat              lat       2                                   (12)
   α=       sin                               + cos(y         ) cos(ŷ         ) sin
                              2                                                                    2
      Then, the Haversine distance of the two points can be calculated by:

                               dH (y, ŷ) = 2r arcsin(α)                         (13)
where r is the radius of the earth. Because of the ellipsoidal shape of the earth, its
radius varies from the equator to the poles. According to [7], we take the mean
of the earth’s radius which amounts to r = 6371 km. Finally, the evaluation
metrics are the mean and median Haversine distances dH in kilometers between
the ground-truth geolocation y and the predicted geolocation ŷ.

3.4     Hyperparameter Setup
In order to obtain good predictive performance, we also need to carefully tune
the hyperparameters in our model. By k ∈ N+ we denote the number of latent
features used within the factorization of X. By λT , λW , λφ , λθ and λT 0 we denote
the regularization hyperparameters used when learning the latent feature matri-
ces, latent vocabulary matrices, the linear regression parameters for predicting
latitude and longitude and the latent features matrices for the test tweets re-
spectively. With αT , αW , αφ , αθ and αT 0 we denote the respective learning rates.
We tune the hyperparameters by assessing the validation performance of our
model and choosing the hyperparameter configuration which performs best. The
number of latent dimensions is selected among the range of k ∈ {2, 4, 8, 16},
while the value of all other hyperparameters are selected among the range of
{0.1, 0.01, 0.001, 0.0001, 0.00001}. The preprocessed datasets used in the paper
are publicly available unconditionally1 .

3.5     Results and Comparison
For the Support Vector Machine (SVM) and Factorization Machines (FM), we
run them separately to predict latitude and longitude. To allow for a fair com-
parison, all these regression models also include the user bias in their estimation.
Finally, we combine the predicted latitude and longitude to conduct a final dis-
tance calculation. For these models, we also apply a grid-search mechanism to
find the best hyperparameter configurations for each prediction of latitude and
longitude. On each dataset, we repeat running the models 10 times and take the
average results. The final results can be observed in Table 1. We can see that
all other regression models on average do not perform that well, mainly due to
them using the extremely sparse 5, 200 TF.IDF features. Our model, however,
maps each tweet individually into an eight-dimensional latent feature space and
uses those features for prediction. The number of k latent feature is found by
grid-search mechanism. The results show that GMF outperforms all competitors
with large margins.
    We also report the state-of-the-art results by classification approaches (see
Table 2). One might notice that there are significant differences in term of accu-
racy prediction in two geolocation prediction scenarios. By targeting user’s ge-
olocation at a given posting’s time, the results show that our model significantly
1
    Available online at: http://fs.ismll.de/publicspace/GMF/
Table 1. The results by regression approaches targeting the user’s geolocation in a
given posting’s time scenario using only textual information. The mean and median
Haversine distance error are in km. The best distances are in bold.
Corpus                  LR US      LR NA       GR US       GR NA       WORLD
Model                mean median mean median mean median mean median mean median
SVM (RBF kernel) [4] 34.63 7.81 157.81 8.42 32.29 8.22 171.72 10.23 3179.57 2654.17
FM [23]              29.67 0.68 164.51 7.27 27.09 0.66 177.53 7.26 3219.16 2650.48
Our model            29.15 0.66 157.22 6.95 26.44 0.65 170.08 7.19 2524.66 553.24


reduces the localization error on the US and NA datasets. For the WORLD
dataset, the average individual tweet’s length is 5 tokens while being 49 tokens
for the concatenation of tweets, our model still achieves reasonable results.


Table 2. The state-of-the-art results by classification approaches targeting the user’s
geolocation in all-at-once scenario using only textual information. The mean and me-
dian Haversine distance error are in km. (”-” signifies no implemented results for the
given dataset, and ”?” signifies that no result was reported for the given metric).

       Corpus                           US           NA       WORLD
       Model                        mean median mean median mean median
       Hierarchical clustering [29]  -     -    686.6 171.5 1669.6 490.0
       Hierarchical topic model [1] ?     298     -     -     -      -


4    Conclusion and Future Work

We have investigated the geo matrix factorization model for the task of near
real-time text-based geolocation in Twitter. In our work, we tackle the user ge-
olocation prediction task in a regression perspective. We analyze a single tweet as
the model’s input without any concatenation. Through this, we can further pre-
dict the user trajectory and achieve geolocation at a given posting’s time. This is
a starting point for further investigation on the affection of tweet concatenation
or the number of tweets needed to achieve an acceptable distance error. Fur-
thermore, We also address the sparsity and imbalance of online conversational
texts by a matrix factorization technique. Based on the experiment results, our
model outperforms all the competitors including SVM and FM within the re-
gression task using dedicated latent feature spaces. In comparison with current
state-of-the-art results by classification approaches, our model still outperforms
and/or achieve reasonable results. Our further improvement broadly falls into
various directions: optimization or applying the model over different datasets.
In the optimization direction, we will analyze direct optimization of the Haver-
sine formula. We also expand our model to predict near real-time geolocation of
another types of datasets such as Wikipedia articles and Flickr images.
Acknowledgments. Nghia Duong-Trung gratefully acknowledges the funding
of his work by the Ministry of Education and Training of Vietnam under the
national project no. 911.


References

 1. Ahmed, A., Hong, L., Smola, A.J.: Hierarchical geographical modeling of user loca-
    tions from social media posts. In: Proceedings of the 22nd international conference
    on World Wide Web. pp. 25–36. International World Wide Web Conferences Steer-
    ing Committee (2013)
 2. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Pro-
    ceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)
 3. Burton, S.H., Tanner, K.W., Giraud-Carrier, C.G., West, J.H., Barnes, M.D.: ”
    right time, right place” health communication on twitter: value and accuracy of
    location information. Journal of medical Internet research 14(6) (2012)
 4. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
    Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011)
 5. Cheng, Z., Caverlee, J., Lee, K.: You are where you tweet: a content-based approach
    to geo-locating twitter users. In: Proceedings of the 19th ACM international con-
    ference on Information and knowledge management. pp. 759–768. ACM (2010)
 6. Compton, R., Jurgens, D., Allen, D.: Geotagging one hundred million twitter ac-
    counts with total variation minimization. In: Big Data (Big Data), 2014 IEEE
    International Conference on. pp. 393–401. IEEE (2014)
 7. Decker, B.L.: World geodetic system 1984. Tech. rep., DTIC Document (1986)
 8. Dias, D., Anastácio, I., Martins, B.: A language modeling approach for georef-
    erencing textual documents. In: Actas del Congreso Español de Recuperación de
    Información (2012)
 9. Dredze, M.: How social media will change public health. Intelligent Systems, IEEE
    27(4), 81–84 (2012)
10. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning
    and stochastic optimization. The Journal of Machine Learning Research 12, 2121–
    2159 (2011)
11. Eisenstein, J., Ahmed, A., Xing, E.P.: Sparse additive generative models of text. In:
    Proceedings of the 28th International Conference on Machine Learning (ICML-11).
    pp. 1041–1048 (2011)
12. Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model
    for geographic lexical variation. In: Proceedings of the 2010 Conference on Em-
    pirical Methods in Natural Language Processing. pp. 1277–1287. Association for
    Computational Linguistics (2010)
13. Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization
    with distributed stochastic gradient descent. In: Proceedings of the 17th ACM
    SIGKDD international conference on Knowledge discovery and data mining. pp.
    69–77. ACM (2011)
14. Han, B., Cook, P., Baldwin, T.: Geolocation prediction in social media data by
    finding location indicative words. Proceedings of COLING 2012: Technical Papers
    pp. 1045–1062 (2012)
15. Han, B., Cook, P., Baldwin, T.: Text-based twitter user geolocation prediction.
    Journal of Artificial Intelligence Research pp. 451–500 (2014)
16. Hong, L., Ahmed, A., Gurumurthy, S., Smola, A.J., Tsioutsiouliklis, K.: Discover-
    ing geographical topics in the twitter stream. In: Proceedings of the 21st interna-
    tional conference on World Wide Web. pp. 769–778. ACM (2012)
17. Jurgens, D.: That’s what friends are for: Inferring location in online social media
    platforms based on social relationships. In: ICWSM (2013)
18. Kinsella, S., Murdock, V., O’Hare, N.: I’m eating a sandwich in glasgow: modeling
    locations with tweets. In: Proceedings of the 3rd international workshop on Search
    and mining user-generated contents. pp. 61–68. ACM (2011)
19. Li, R., Wang, S., Deng, H., Wang, R., Chang, K.C.C.: Towards social user pro-
    filing: unified and discriminative influence model for inferring home locations. In:
    Proceedings of the 18th ACM SIGKDD international conference on Knowledge
    discovery and data mining. pp. 1023–1031. ACM (2012)
20. Mahmud, J., Nichols, J., Drews, C.: Home location identification of twitter users.
    ACM Transactions on Intelligent Systems and Technology (TIST) 5(3), 47 (2014)
21. McClendon, S., Robinson, A.C.: Leveraging geospatially-oriented social media com-
    munications in disaster response. International Journal of Information Systems for
    Crisis Response and Management (IJISCRAM) 5(1), 22–40 (2013)
22. McGee, J., Caverlee, J., Cheng, Z.: Location prediction in social media based on tie
    strength. In: Proceedings of the 22nd ACM international conference on Conference
    on information & knowledge management. pp. 459–468. ACM (2013)
23. Rendle, S.: Factorization machines with libfm. ACM Transactions on Intelligent
    Systems and Technology (TIST) 3(3), 57 (2012)
24. Robusto, C.: The cosine-haversine formula. American Mathematical Monthly pp.
    38–40 (1957)
25. Roller, S., Speriosu, M., Rallapalli, S., Wing, B., Baldridge, J.: Supervised text-
    based geolocation using language models on an adaptive grid. In: Proceedings of
    the 2012 Joint Conference on Empirical Methods in Natural Language Processing
    and Computational Natural Language Learning. pp. 1500–1510. Association for
    Computational Linguistics (2012)
26. Rout, D., Bontcheva, K., Preoţiuc-Pietro, D., Cohn, T.: Where’s@ wally?: a clas-
    sification approach to geolocating users based on their social ties. In: Proceedings
    of the 24th ACM Conference on Hypertext and Social Media. pp. 11–20. ACM
    (2013)
27. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time
    event detection by social sensors. In: Proceedings of the 19th international confer-
    ence on World wide web. pp. 851–860. ACM (2010)
28. Weng, J., Lee, B.S.: Event detection in twitter. ICWSM 11, 401–408 (2011)
29. Wing, B., Baldridge, J.: Hierarchical discriminative classification for text-based ge-
    olocation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural
    Language Processing. pp. 336–348 (2014)
30. Wing, B.P., Baldridge, J.: Simple supervised document geolocation with geodesic
    grids. In: Proceedings of the 49th Annual Meeting of the Association for Compu-
    tational Linguistics: Human Language Technologies-Volume 1. pp. 955–964. Asso-
    ciation for Computational Linguistics (2011)
31. Yin, J., Lampert, A., Cameron, M., Robinson, B., Power, R.: Using social media
    to enhance emergency situation awareness. IEEE Intelligent Systems (6), 52–59
    (2012)