=Paper= {{Paper |id=Vol-1392/paper-15 |storemode=property |title=Modelling Time and Location in Topic Models |pdfUrl=https://ceur-ws.org/Vol-1392/paper-15.pdf |volume=Vol-1392 |dblpUrl=https://dblp.org/rec/conf/icml/Politz15 }} ==Modelling Time and Location in Topic Models== https://ceur-ws.org/Vol-1392/paper-15.pdf
                           Modelling Time and Location in Topic Models


Christian Pölitz                                                                 CHRISTIAN . POELITZ @ TU - DORTMUND . DE
TU Dortmund University, Artificial Intelligence Group,
Otto Hahn Str. 12, 44227 Dortmund, Germany



                          Abstract                                   tional information, NLP tools would divide the sentence
     Many text collections like news paper or social                 in: ”The weather was nice in Berlin last Sunday” and ”but
     media blog archives contain texts that often refer              next day at home in Cologne it was cloudy” and label the
     to special dates and/or locations. These informa-               words in the first chunk with the time stamp of that corre-
     tion can be valuable to investigate topics in cer-              sponding Sunday and the geographical location of Berlin.
     tain regions and time spans. We use topic models                The words in the second chunk are labeled with the time
     that integrate time and geographical information                stamp of that corresponding Monday and the geographical
     extracted from the texts to find such topics. In                location of Cologne.
     this extended abstract, we motivate our approach
     and shortly describe the method. Experimental                   2. Related Work
     evaluations and detailed description are in prepa-
     ration to a full paper.                                         There are several previous approaches that integrate tem-
                                                                     poral and positional information into topic models. In (Yin
                                                                     et al., 2011) Yin et al. discuss methods to find and com-
1. Introduction                                                      pare topics in documents that have associated GPS coordi-
                                                                     nate. Speriosu et al. propose in (Speriosu et al., 2010) to
Topic models (see for instance (Blei et al., 2003)) have             use topic models that use non-overlapping regions as latent
been used extensively to summarize text collections into             topics. By this, they model each document as distribution
semantic clusters. Such text collections can contain for in-         over these regions. Further approaches use topic models
stance news paper articles, Blog entries, tweets or any writ-        with geographic information on social media data to ex-
ten social media content. The documents in these collec-             tract activity patterns of users. Hasan and Ukkusuri for in-
tions contain often information about locations and dates.           stance use in (Hasan & Ukkusuri, 2014) topic models that
These information are valuable for extracting topics for cer-        integrate sequences of activities rather than documents. In
tain regions or time spans. In order to integrate temporal           (Hong et al., 2012) Hong et al. introduce a sparse genera-
and positional information, we need the corresponding time           tive topic model and in (Kurashima et al., 2013) Kurashima
and location information for our text corpus. Previous ap-           et al. propose a geographic topic model that use Twitter
proaches assumed that we either directly have information            tweets to extract user activities in terms of movement and
about time and position for each document in the corpus or           interests.
that a named entity recognition tool finds geographic loca-
tions. We propose a hybrid approach that extends standard            3. Method
Latent Dirichlet Allocation (LDA (Blei et al., 2003)) topic
models. We assume that the documents can have multiple               While the standard topic models group only words and doc-
dates and positions. In order to integrate multiple infor-           uments in semantically related topics, we are further in-
mation for single documents, we propose to extract parts             terested in the distribution of the topics over time and ge-
(or chunks) in the documents that contain only informa-              ographic position. In order to extract the distribution of
tion about one location and one date. For example, a docu-           word senses over time and positions, we use topic models
ment might contain the sentence: ”The weather was nice               that consider temporal information about the documents as
in Berlin last Sunday, but next day at home in Cologne               well as locations in form of numerical vectors that represent
it was cloudy.”. In order to reflect all temporal and posi-          the geographic position. This means, each document has at
                                                                     least one time stamp and one geographic position. The time
Proceedings of the 2 nd International Workshop on Mining Urban       stamps are assumed to be Beta distributed and the position
Data, Lille, France, 2015. Copyright c 2015 for this paper by its
authors. Copying permitted for private and academic purposes.        Normally distributed. The two distributions are simply in-




                                                                    N8
                                                Modelling Time and Location in Topic Models

tegrated in an LDA topic model under the assumption that                  Using this, we can estimate the topic model via Gibbs sam-
given the latent topics, the words, the time stamps and ge-               pling.
ographic positions are independent. We combine the meth-
ods by Wang and McCallum (Wang & McCallum, 2006)                          References
introduced as topics over time and supervised LDA intro-
duced by Blei et al. (Blei & McAuliffe, 2007).                            Blei, David M. and McAuliffe, Jon D. Supervised topic
                                                                            models. In Advances in Neural Information Process-
The generative process of the words, dates and location is:                 ing Systems 20, Proceedings of the Twenty-First Annual
                                                                            Conference on Neural Information Processing Systems,
 1. For each topic t:                                                       Vancouver, British Columbia, Canada, December 3-6,
      (a) Draw θt ∼ Dir(β)                                                  2007, pp. 121–128, 2007.
 2. For each document d:                                                  Blei, David M., Ng, Andrew Y., and Jordan, Michael I.
      (a) Draw φd ∼ Dir(α)                                                  Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–
      (b) For each chunk c(d):                                              1022, March 2003. ISSN 1532-4435.
           i. For each word i in c(d):                                    Hasan, Samiul and Ukkusuri, Satish V. Urban activity pat-
             A. Draw zi ∼ M ult(φd )                                        tern classification using topic models from online geo-
             B. Draw wi ∼ M ult(θti )                                       location data. Transportation Research Part C: Emerg-
             C. Draw tc ∼ Beta(ψzi )                                        ing Technologies, 44(0):363 – 381, 2014. ISSN 0968-
                                   ˆ , ρ2 )
             D. Draw lc ∼ N (η ′ zc(d)                                      090X. doi: http://dx.doi.org/10.1016/j.trc.2014.04.003.
                                                                          Hong, Liangjie, Ahmed, Amr, Gurumurthy, Siva, Smola,
Assuming a number of topics we draw for each of them a
                                                                            Alexander J., and Tsioutsiouliklis, Kostas. Discover-
Multinominal distribution over words in this topic from a
                                                                            ing geographical topics in the twitter stream. In Pro-
Dirichlet distribution Dir(β) with metaparameter β. For
                                                                            ceedings of the 21st International Conference on World
each document we draw a Multinominal distribution of
                                                                            Wide Web, WWW ’12, pp. 769–778, New York, NY,
the topics in this document from a Dirichlet distribution
                                                                            USA, 2012. ACM. ISBN 978-1-4503-1229-5. doi:
Dir(α) with metaparameter α. For each word in the docu-
                                                                            10.1145/2187836.2187940.
ment we draw a topic with respect to the topic distribution
in the document and a word based on the word distribution                 Kurashima, Takeshi, Iwata, Tomoharu, Hoshide, Takahide,
for the drawn topic. Additionally, we draw a time stamp                     Takaya, Noriko, and Fujimura, Ko. Geo topic model:
ti ∼ Beta(ψzi ) with ψzi = (a, b) the shape parameters of                   Joint modeling of user’s activity area and interests for
                                                     ˆ , ρ2 )
the Beta distribution and the location li ∼ N (η ′ zc(d)                    location recommendation. In Proceedings of the Sixth
         ˆ ′ the empirical topic frequencies for document
with zc(d)                                                                  ACM International Conference on Web Search and Data
d. The shape parameters ψ are estimated by the method                       Mining, WSDM ’13, pp. 375–384, New York, NY, USA,
of moments. For each topic z we estimate the mean m̂                        2013. ACM. ISBN 978-1-4503-1869-3. doi: 10.1145/
and sample variance s2 of all time stamps from the doc-                     2433396.2433444.
uments that have been assigned this topic. We set a =
m̂ · ( m̂·(1− m̂)
                  − 1) and b = (1 − m̂) · ( m̂·(1−  m̂)
                                                        − 1)              Speriosu, M., Brown, T., Moon, T., Baldridge, J., and Erk,
           s2                                   s2
for each topic. Finally, for the Normal distribution, η                     K. Connecting Language and Geography with Region-
is estimated via EM methods, that minimize the likeli-                      Topic Models. 2010.
hoodP  during the estimation ofPthe topic model: L(η) =                   Wang, Xuerui and McCallum, Andrew. Topics over time:
   1               ′    2     1      2
− 2ρ     d (yd − η zˆd ) − − 2σ   k ηk                                     A non-markov continuous-time model of topical trends.
Integrating the time stamp as Beta distributed random vari-                In Proceedings of the 12th ACM SIGKDD International
able and the geographic location as Normal distributed ran-                Conference on Knowledge Discovery and Data Mining,
dom variable, we get for the probability of a topic zi , given             KDD ’06, pp. 424–433, New York, NY, USA, 2006.
a word w in a chunk c(d) with time stamp t and location l                  ACM. ISBN 1-59593-339-5. doi: 10.1145/1150402.
and all other topic assignments:                                           1150450.

              p(zi |w, t, l, z1 , · · · zi−1 , zi+1 , · · · zT )          Yin, Zhijun, Cao, Liangliang, Han, Jiawei, Zhai, Chengx-
                                                                            iang, and Huang, Thomas. Geographical topic discov-
                    Nw,zi − 1 + β
              ∝                            · (Nd,zi + α) ·                  ery and comparison. In Proceedings of the 20th Interna-
                 Nzi − 1 + W · β                                            tional Conference on World Wide Web, WWW ’11, pp.
   (1 − tc(d) )a−1 · tb−1
                      c(d)              klc(d) − µw,d k2                    247–256, New York, NY, USA, 2011. ACM. ISBN 978-
                             · exp(−                     )         (1)      1-4503-0632-4. doi: 10.1145/1963405.1963443.
        Beta(a, b)                             2ρ




                                                                         Ne