Modelling Time and Location in Topic Models Christian Pölitz CHRISTIAN . POELITZ @ TU - DORTMUND . DE TU Dortmund University, Artificial Intelligence Group, Otto Hahn Str. 12, 44227 Dortmund, Germany Abstract tional information, NLP tools would divide the sentence Many text collections like news paper or social in: ”The weather was nice in Berlin last Sunday” and ”but media blog archives contain texts that often refer next day at home in Cologne it was cloudy” and label the to special dates and/or locations. These informa- words in the first chunk with the time stamp of that corre- tion can be valuable to investigate topics in cer- sponding Sunday and the geographical location of Berlin. tain regions and time spans. We use topic models The words in the second chunk are labeled with the time that integrate time and geographical information stamp of that corresponding Monday and the geographical extracted from the texts to find such topics. In location of Cologne. this extended abstract, we motivate our approach and shortly describe the method. Experimental 2. Related Work evaluations and detailed description are in prepa- ration to a full paper. There are several previous approaches that integrate tem- poral and positional information into topic models. In (Yin et al., 2011) Yin et al. discuss methods to find and com- 1. Introduction pare topics in documents that have associated GPS coordi- nate. Speriosu et al. propose in (Speriosu et al., 2010) to Topic models (see for instance (Blei et al., 2003)) have use topic models that use non-overlapping regions as latent been used extensively to summarize text collections into topics. By this, they model each document as distribution semantic clusters. Such text collections can contain for in- over these regions. Further approaches use topic models stance news paper articles, Blog entries, tweets or any writ- with geographic information on social media data to ex- ten social media content. The documents in these collec- tract activity patterns of users. Hasan and Ukkusuri for in- tions contain often information about locations and dates. stance use in (Hasan & Ukkusuri, 2014) topic models that These information are valuable for extracting topics for cer- integrate sequences of activities rather than documents. In tain regions or time spans. In order to integrate temporal (Hong et al., 2012) Hong et al. introduce a sparse genera- and positional information, we need the corresponding time tive topic model and in (Kurashima et al., 2013) Kurashima and location information for our text corpus. Previous ap- et al. propose a geographic topic model that use Twitter proaches assumed that we either directly have information tweets to extract user activities in terms of movement and about time and position for each document in the corpus or interests. that a named entity recognition tool finds geographic loca- tions. We propose a hybrid approach that extends standard 3. Method Latent Dirichlet Allocation (LDA (Blei et al., 2003)) topic models. We assume that the documents can have multiple While the standard topic models group only words and doc- dates and positions. In order to integrate multiple infor- uments in semantically related topics, we are further in- mation for single documents, we propose to extract parts terested in the distribution of the topics over time and ge- (or chunks) in the documents that contain only informa- ographic position. In order to extract the distribution of tion about one location and one date. For example, a docu- word senses over time and positions, we use topic models ment might contain the sentence: ”The weather was nice that consider temporal information about the documents as in Berlin last Sunday, but next day at home in Cologne well as locations in form of numerical vectors that represent it was cloudy.”. In order to reflect all temporal and posi- the geographic position. This means, each document has at least one time stamp and one geographic position. The time Proceedings of the 2 nd International Workshop on Mining Urban stamps are assumed to be Beta distributed and the position Data, Lille, France, 2015. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes. Normally distributed. The two distributions are simply in- N8 Modelling Time and Location in Topic Models tegrated in an LDA topic model under the assumption that Using this, we can estimate the topic model via Gibbs sam- given the latent topics, the words, the time stamps and ge- pling. ographic positions are independent. We combine the meth- ods by Wang and McCallum (Wang & McCallum, 2006) References introduced as topics over time and supervised LDA intro- duced by Blei et al. (Blei & McAuliffe, 2007). Blei, David M. and McAuliffe, Jon D. Supervised topic models. In Advances in Neural Information Process- The generative process of the words, dates and location is: ing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, 1. For each topic t: Vancouver, British Columbia, Canada, December 3-6, (a) Draw θt ∼ Dir(β) 2007, pp. 121–128, 2007. 2. For each document d: Blei, David M., Ng, Andrew Y., and Jordan, Michael I. (a) Draw φd ∼ Dir(α) Latent dirichlet allocation. J. Mach. Learn. Res., 3:993– (b) For each chunk c(d): 1022, March 2003. ISSN 1532-4435. i. For each word i in c(d): Hasan, Samiul and Ukkusuri, Satish V. Urban activity pat- A. Draw zi ∼ M ult(φd ) tern classification using topic models from online geo- B. Draw wi ∼ M ult(θti ) location data. Transportation Research Part C: Emerg- C. Draw tc ∼ Beta(ψzi ) ing Technologies, 44(0):363 – 381, 2014. ISSN 0968- ˆ , ρ2 ) D. Draw lc ∼ N (η ′ zc(d) 090X. doi: http://dx.doi.org/10.1016/j.trc.2014.04.003. Hong, Liangjie, Ahmed, Amr, Gurumurthy, Siva, Smola, Assuming a number of topics we draw for each of them a Alexander J., and Tsioutsiouliklis, Kostas. Discover- Multinominal distribution over words in this topic from a ing geographical topics in the twitter stream. In Pro- Dirichlet distribution Dir(β) with metaparameter β. For ceedings of the 21st International Conference on World each document we draw a Multinominal distribution of Wide Web, WWW ’12, pp. 769–778, New York, NY, the topics in this document from a Dirichlet distribution USA, 2012. ACM. ISBN 978-1-4503-1229-5. doi: Dir(α) with metaparameter α. For each word in the docu- 10.1145/2187836.2187940. ment we draw a topic with respect to the topic distribution in the document and a word based on the word distribution Kurashima, Takeshi, Iwata, Tomoharu, Hoshide, Takahide, for the drawn topic. Additionally, we draw a time stamp Takaya, Noriko, and Fujimura, Ko. Geo topic model: ti ∼ Beta(ψzi ) with ψzi = (a, b) the shape parameters of Joint modeling of user’s activity area and interests for ˆ , ρ2 ) the Beta distribution and the location li ∼ N (η ′ zc(d) location recommendation. In Proceedings of the Sixth ˆ ′ the empirical topic frequencies for document with zc(d) ACM International Conference on Web Search and Data d. The shape parameters ψ are estimated by the method Mining, WSDM ’13, pp. 375–384, New York, NY, USA, of moments. For each topic z we estimate the mean m̂ 2013. ACM. ISBN 978-1-4503-1869-3. doi: 10.1145/ and sample variance s2 of all time stamps from the doc- 2433396.2433444. uments that have been assigned this topic. We set a = m̂ · ( m̂·(1− m̂) − 1) and b = (1 − m̂) · ( m̂·(1− m̂) − 1) Speriosu, M., Brown, T., Moon, T., Baldridge, J., and Erk, s2 s2 for each topic. Finally, for the Normal distribution, η K. Connecting Language and Geography with Region- is estimated via EM methods, that minimize the likeli- Topic Models. 2010. hoodP during the estimation ofPthe topic model: L(η) = Wang, Xuerui and McCallum, Andrew. Topics over time: 1 ′ 2 1 2 − 2ρ d (yd − η zˆd ) − − 2σ k ηk A non-markov continuous-time model of topical trends. Integrating the time stamp as Beta distributed random vari- In Proceedings of the 12th ACM SIGKDD International able and the geographic location as Normal distributed ran- Conference on Knowledge Discovery and Data Mining, dom variable, we get for the probability of a topic zi , given KDD ’06, pp. 424–433, New York, NY, USA, 2006. a word w in a chunk c(d) with time stamp t and location l ACM. ISBN 1-59593-339-5. doi: 10.1145/1150402. and all other topic assignments: 1150450. p(zi |w, t, l, z1 , · · · zi−1 , zi+1 , · · · zT ) Yin, Zhijun, Cao, Liangliang, Han, Jiawei, Zhai, Chengx- iang, and Huang, Thomas. Geographical topic discov- Nw,zi − 1 + β ∝ · (Nd,zi + α) · ery and comparison. In Proceedings of the 20th Interna- Nzi − 1 + W · β tional Conference on World Wide Web, WWW ’11, pp. (1 − tc(d) )a−1 · tb−1 c(d) klc(d) − µw,d k2 247–256, New York, NY, USA, 2011. ACM. ISBN 978- · exp(− ) (1) 1-4503-0632-4. doi: 10.1145/1963405.1963443. Beta(a, b) 2ρ Ne