-

Evolution of Semantically Identi ed Topics

Victor Mireles

Artem Revenko

artem.revenkog@semantic-web.com 0 0 Semantic Web Company , Vienna Austria

Topics in a corpus evolve in time. Describing the way this evolution occurs helps us to understand the change in the prominence of concepts: one can gain intuition about what concepts become more important in a given topic, which substitute others, or which concepts become related. By de ning topics as weighted collections of concepts from a xed taxonomy, it is possible to know if said evolution occurs within a branch of taxonomy or if hitherto unknown semantic relationships are formed / dissolved with time. In this work, we analyze a corpus of nancial news and reveal the evolution of topics composed of concepts from the STW thesaurus. We show that using a thesaurus for building representations of documents is useful. Furthermore, the di erent abstraction levels encoded in a taxonomy are helpful for detecting di erent types of topics.

topic discovery taxonomy thesaurus topic evolution topic modelling

Of the many dimensions that can be ascribed to text corpora, the topics that they deal with is a very intuitive one for human readers. In a sense, topics constitute subgraphs of a \platonic knowledge graph", where only certain concepts and certain relations exist. Two qualities of topics are of particular interest: they re ect the intentions and context of the author of a text, and they are often treated in di erent texts by di erent authors. For these reasons, understanding their evolution in time can be treated as a proxy for studying the context of the authors.

In the case of corpora of news articles, understanding the evolution of topics can give insights into which entities become important in a given topic, or how they loose their importance. Furthermore, the detection of emerging and fading topics can be of interest for signalling major events.

In this work, we approach the study of topic evolution by using controlled, semantically enriched vocabularies. With our method, it is possible to describe the topics present in a certain time in terms of the topics present in the previous time points. With this in hand, the method is able to recover stable topics that are consistently dealt with in the news. Furthermore, detection of important events that shift the composition of topics is also possible. Finally, we perform an analysis of the topic-identi cation power of di erent levels of abstraction, as de ned by a thesaurus. 1.1

Topic Discovery

Topic discovery, also known as topic modelling, is the task of analyzing a corpus and extracting from it clusters of terms that are semantically related. When approached with statistical tools, the semantic relationship of the discovered topics is deduced from their distributional properties: terms that co-occur more often are deemed to be semantically related. Usually topic discovery is approached by rst representing documents in terms of bags of words, n-grams[ 16 ] or embedded representations[ 11 ], forming from such representation a documentterm matrix and, nally, inferring topics from said matrix. This last step is performed by matrix decomposition methods such as SVD (a.k.a. LSA[ 9 ]) or NMF[ 13 ], or by generative probabilistic models such as LDA[ 2 ] or PLSA[ 8 ]. The outcome of topic discovery is a collection of sets of terms, called topics, such that each document in the corpus can be assigned, often with a certain probability, to one of the topics.

In the above described scenario, the only semantic relations between the terms that we have access to are those statistically discovered based on the documentterm matrix. However, in many applications further semantic relations are known between the terms. For example, if information about synonyms is known, topic detection can be done by counting occurrences of synsets[ 6 ].

In this work, we aim at incorporating further semantic information into the topic discovery process by use of a thesaurus. A thesaurus is a controlled vocabulary whose concepts are organized according to their hypernym/hyponym relations. In e ect, it is a multihierarchical directed graph, where a node represents a concept and edges represent hypernym or hyponym relations. Each concept is assigned one or more labels: strings that can be matched against a document. 1.2

Topic Evolution

Studying topic evolution can be seen as a study of the history of ideas [ 7 ]. By performing topic discovery on several corpora, each of which has a timestamp, it is possible to see the transitions in interests in the author(s) of the corpora. This might be useful to discover how a given topic is treated di erently in di erent times, thus constituting a proxy to study the evolution of semiotics. Several approaches have been adopted to the study of topic evolution. Some perform topic discovery independently in every corpus and only afterwards analyze the relationships between them (e.g. [ 7 ]) while others perform topic discovery in a corpus based on the topics discovered in the previous one (e.g. [ 1, 15, 14 ]). The former approach is subject to the variation inherent to topic discovery methods, which, in particular, can lead to topics being "lost" from one corpus to the next due to corpus quality or size. This can lead to the independently discovered topics being di cult to compare. The latter approach has two main limitations: 1) that new " ash topics" are hard to detect, and 2) they become over-sensible to parameters, such as thresholds for estimating the number of topics. However, in the preliminary experiments we have con rmed that dynamic topic models and plain NMF with subsequent estimation of transition between di erent time points yield similar results.

In this work, we present an intermediate approach. Topics are discovered independently for each corpus, and corpora in successive times are analyzed to determine in which ways did the topics transit into others or appeared de-novo. This second step allows us to describe the evolution of topics not just as the evolution of sets of co-occurring terms, but rather as the merging and splitting of existing topics. Hence, we are enabled to de ne the notion of persistent topic, i.e. topics that appear in several consecutive corpora. 2

Data

The dataset we analyze is a nancial news data set. The news come from a single source (Bloomberg news) and are made available as supporting data in[ 4 ]. From the original dataset, we took articles between January 2009 and November 2013, which total 447,145 documents. We consider the documents within each week, starting on Friday, a di erent corpus. Only a subset of the original dataset was used, in order to guarantee that all corpora have at least 50 documents. The sizes of corpora range between 50 and 4900, with a mean corpus size of 2589 documents.

The semantic relationships between concepts that we are using in this work are those expressed by skos:narrower and skos:broader predicates in version 9.02 of the STW Thesaurus for Economics[ 3 ]. This thesaurus consists of 6221 concepts, of which 4108 are leaves (i.e. they have no narrower concepts). The wide range of concepts included in the thesaurus make it ideal for analyzing the corpus of nancial news, specially because of its concept schemes of Geographic Names and General Descriptors. For the purposes of this work, the predicate skos:topConceptOf is considered to be equivalent to skos:broader. 3

Methods

Entity extraction Entity extraction was performed using PoolParty Semantic Suite1. In brief, the texts are pre-processed in the same manner as the labels 1 poolparty.biz from the thesaurus, namely, stopwords are removed, tokens are lemmatized, ngrams up to 4-grams are constituted. Then, a matching is done to identify all the concepts appearing in the documents. Thus, for every concept c in the thesaurus and every document, we have computed nd(c), the number of times any label of c appears in document d. Finally, all the documents corresponding to the same week were put together into a single corpus, which we will denote by Cw, where w is the week number.

Representing documents For each document d in each corpus, two vector representations are computed.

The rst, which we call level 0 representation, contains information only about a subset of all concepts that have no narrower concept in the thesaurus. We call this set the set of leaves, and denote it by l1; l2; ::::; ; lm0 . The level 0 representation of a document d is then a m0-dimensional vector V0(d) whose i'th entry is given by V0(d)[i] = ndn(dli) , where nd is the number of tokens in document d. The second representation, called level 1 representation, contains only information of those concepts that are broader concepts of some leaf. If we denote that set by b1; b2; ::::; bm1 , then this representation consists of a m1 dimensional vector, whose i'th entry is given by

V1(d)[i] =

X l2L(bi) nd(l) nd (1) where L(c) denotes the set of nodes which are narrower than concept c. It must be noted that with both of these representations only the occurrences of leaf concepts is considered. The di erence being that in level 1 representations the occurrence of concepts that are narrower of one same concept are grouped together. With these two representations de ned, we can represent each corpus Cw by two matrices: A0(w) and A1(w), both of which have as many columns as documents in corpus Cw, but with m0 and m1 rows respectively. Detecting topics in a week For each week w, we compute a Non-Negative Matrix Factorization (NMF)[ 10 ] on the two matrices A0(w) and A1(w). NMF decomposes a matrix A 2 R+m n into the product T S of two matrices, with T 2 R+m k and S 2 Rk+ n. To choose the value of k, we rst compute 1; 2; ::::; n all singular values of A in descending order, and then choose the k that maximizes k+k1 k+k+12 . This method is equivalent to the eigenvalue gap trick [ 5 ] in the case of non-negative matrices. The NMF decompositions were computed using the scikit-learn library[ 12 ] setting the parameter l1ratio to 1. We followed NMF by a sparsifying step: the smallest threshold was found (using gradient descent) such that if the matrix T is the result of setting all entries of T whose value is lower than to 0, then the density of T S is not more than one half the original density of A. From now on, we refer to T simply as T . The use of NMF yields, for each week w two pairs of matrices: T0(w); S0(w), and T1(w); S1(w). We call the matrices T resulting from NMF the concept-topic matrices. If T [c; j] > 0, we say that concept c belongs to topic j with a degree of T [c; j]. Thus, for every week w we can compute two sets of topics, one for each representation.

Detection of Topic Transitions For a representation q 2 f0; 1g and two consecutive weeks we get from the previous steps the matrices Tq(w) 2 R+mq k1 and Tq(w+1) 2 R+mq k2 , where k1 and k2 are the numbers of topics in di erent weeks. In order to detect transitions between topics in these two consecutive weeks, we solve the optimization problem of nding the matrix Mq(w) 2 [0; 1]k1 k2 that minimizes jjTq(w)Mq(w) Tq(w + 1)jj2. The resulting matrix Mq(w), which we call transition matrix, expresses each topic in week w + 1 as a linear combination of the topics in week w. The following points help in the interpretation of transition matrices: { If two topics merge into a new topic in the next week, then the new topics column will have large entries in the rows corresponding to the two previous topics. { A new topic in week w + 1 will correspond to a column whose entries are all small: it is not similar to any topic in the previous week { A topic whose concepts don't change between two consecutive weeks can be detected by a column and row that both have a single entry close to 1. { One can think of topic transition as the process of the topic from the previous week distributing its weight into the topics of the current week. The topic cannot give more weight than it has.

With a set of consecutive transition matrices Mq(w); Mq(w + 1); : : : ; Mq(w + g), it is possible to detect topics which remain stable for several weeks: they will be a sequence of indices t1; t2; ::::; tg) such that 1 Mq(w)[ti; ti+1] for i = 1:::g 1 and some small . We consider a topic to be a stable topic if the above condition holds for g >= 5 with = 0:2, i.e. if the topic keeps approximately with the same concept composition for at least 4 weeks. Figure 1 is an example of a set of transitions matrices that exhibit a stable topic. 4

Results

We decomposed all corpora into topics leading to di erent number of topics per week (Fig 2). After computing the corresponding transition matrices we are able to detect the appearance of new topics as well as several stable topics. Among them, we can talk of persistent and ash topics.

50 s40 c i p to30 f o re20 b m uN10 0

Representation 0

Representation 1 total new s40 c i p to30 f o re20 b m u10 N 0 total new 0 50

100 150 Week number 200 250 0 50

100 150 Week number 200 250

Persistent topics are topics that the news source treats regularly. While interruptions in the data (weeks with few articles) sometimes fragmented these topics, in the sense of our de nition of stable topics, they were quick to recover. Several stable topics represented through their most important concepts are presented in Table 1. Furthermore, we were able to detect Flash topics, that is topics which relate to speci c, transient events. We found the following ash topics in level 0 representation: 1. the 2010 South Africa World Football Cup, 2. the 2010 artillery re exchange in the Korean peninsula. 3. the 2011 Drought, 4. the Arab Spring, 5. the Fukushima Daiichi nuclear disaster, The most frequently mentioned concepts in these topics can be seen in table 2. The evolution of topics can best be exempli ed by the Football World Cup example. In table 3, the change in the topic across weeks is shown. Notice how the number of countries decreases, and those which remain are also those which remained in the tournament. Let us recall that the tournament ended on the 11th of July.

Many of these topics were found also in the level 1 representation. In general, the level 1 representation yields longer stable topics, as can be seen on the length distributions of stable topics in Figure 3. It is worth mentioning, that a topic which is very fragmented (in time) in the level 0 representation is less so in the level 1 representation: that of the Japanese market. This suggests that looking at more abstract concepts can increase stability of topic detection. It is important that, while using the broader concept of a set of leaves increases stability, there is no evidence that similar topics are confounded by this process. This is an indication that topics are not necessarily matching branches of the taxonomy but, rather, are combinations of topics from across the thesaurus. For the same reason, topics in consecutive weeks which are not deemed persistent under the level 0 representation, become so in the level 1 representation. It is thus important to choose carefully the level of granularity of the concepts that will be used to annotate a corpus with the aim of persistent topic detection. It must also be noted that detecting topics based solely on the words (i.e. without a controlled vocabulary) does not provide this possibility.

Finally preliminary results show that, on average, concepts gained by a topic during its evolution are closer in the thesaurus to already existing topics that is expected at random. 5

Conclusions and Future Work

We have presented a method that is able to detect both persistent and transient topics in news sources. Interestingly, we have shown that it is possible to detect such topics both when annotating documents only with leaves from a thesaurus, and when annotating them also with those concepts directly above leaves. Furthermore, we have shown that the relatively simple NMF method is able to detect stable topics, and that this stability can also be captured by our proposed method for computing topic transitions. By considering the news articles in each week as independent corpora, we are able to detect short-lived topics, that would otherwise be lost in a global topic modeling.

The fact that topics are detectable in both representations is an indication that topics are not easily confounded with each other when one considers more abstract categories. This is an interesting result, for it shows that the concepts comprising a topic are distributed widely enough across the thesaurus that abstracting them just one level still allows for their detection. In a sense, the way that the concepts have been organized in the thesaurus do not match the realworld occurring topics. We believe that this result can serve as a measure of the

Level 0 Representation

Level 1 Representation 50 40 30 20 10 50 40 30 20 10 0 5 15 25 35 45 55 65 75 85 95

Stable topic duration 0 5 15 25 35 45 55 65 75 85 95

Stable topic duration generality of a thesaurus, and its applicability to analyzing texts from various sources.

This work represents initial results in the analysis of topic transitions with the help of a thesaurus. In future work we intend to build on top of the current results and extend the methodology. In particular, the current results motivate us to: 1. Investigate and compare representation of topics in di erent levels. The preliminary observations suggest that several stable topics at level 0, in nonoverlapping weeks, could be merged into a single topic at level 1. 2. The topics at level 1 appear to be more stable. This could be useful in the case of limited data, when the detailed topic could fade away. 3. The di erent levels of representation could prove to be useful for people with di erent backgrounds. Namely, we expect that more detailed topics could be of interest for experts in the eld, whereas the more general topics could give a good overview for less experienced readers. 4. We aim at improving the transition computation with the help of taking the distances between concepts into account and employing methods similar to soft cosine similarity. 5. Finally, statistical analysis is required to con rm or observations that the concepts gained during topic evolution are more likely to be close, in the thesaurus, to concepts already in a topic.

Acknowledgements The work is partially supported by the PROFIT (http: //projectprofit.eu/) project. Part of the European Commission's H2020 Framework Programme, grant agreement no. 687895,

1. Blei , D.M. , La

erty

, J.D.: Dynamic topic models . In: Proceedings of the 23rd international conference on Machine learning . pp. 113 { 120 . ACM ( 2006 )

2. Blei , D.M. , Ng , A.Y. , Jordan , M.I. : Latent dirichlet allocation . Journal of machine Learning research 3(Jan) , 993 { 1022 ( 2003 )

3. Borst , T. , Neubert , J. : Case study: Publishing stw thesaurus for economics as linked open data . W3C Semantic Web Use Cases and Case Studies ( 2009 )

4. Ding , X. , Zhang , Y. , Liu , T. , Duan , J.: Using structured events to predict stock price movement: An empirical investigation . In: EMNLP . pp. 1415 { 1425 ( 2014 )

5. Djurdjevac , N. , Sarich , M. , Schutte, C.: Estimating the eigenvalue error of markov state models . Multiscale Modeling & Simulation 10 ( 1 ), 61 { 81 ( 2012 )

6. Ferrugento , A. , Alves , A. , Oliveira , H.G. , Rodrigues , F. : Towards the improvement of a topic model with semantic knowledge . In: Portuguese Conference on Arti cial Intelligence . pp. 759 { 770 . Springer ( 2015 )

7. Hall , D. , Jurafsky , D. , Manning , C.D.: Studying the history of ideas using topic models . In: Proceedings of the conference on empirical methods in natural language processing . pp. 363 { 371 . Association for Computational Linguistics ( 2008 )

8. Hofmann , T. : Unsupervised learning by probabilistic latent semantic analysis . Machine learning 42(1) , 177 { 196 ( 2001 )

9. Landauer , T.K. , Laham , D. , Foltz , P.W. : Learning human-like knowledge by singular value decomposition: A progress report . In: Advances in neural information processing systems . pp. 45 { 51 ( 1998 )

10. Lee , D.D. , Seung , H.S. : Learning the parts of objects by non-negative matrix factorization . Nature 401 ( 6755 ), 788 ( 1999 )

11. Nguyen , D.Q. , Billingsley , R. , Du , L. , Johnson , M.: Improving topic models with latent feature word representations . Transactions of the Association for Computational Linguistics 3 , 299 { 313 ( 2015 )

12. Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , Vanderplas , J. , Passos , A. , Cournapeau , D. , Brucher , M. , Perrot , M. , Duchesnay , E.: Scikit-learn: Machine learning in Python . Journal of Machine Learning Research 12 , 2825 { 2830 ( 2011 )

13. Recht , B. , Re , C. , Tropp , J. , Bittorf , V. : Factoring nonnegative matrices with linear programs . In: Advances in Neural Information Processing Systems . pp. 1214 { 1222 ( 2012 )

14. Saha , A. , Sindhwani , V. : Learning evolving and emerging topics in social media: A dynamic nmf approach with temporal regularization . In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining . pp. 693 { 702 . WSDM '12, ACM , New York, NY, USA ( 2012 )

15. Vaca , C.K. , Mantrach , A. , Jaimes , A. , Saerens , M.: A time-based collective factorization for topic discovery and monitoring in news . In: Proceedings of the 23rd International Conference on World Wide Web . pp. 527 { 538 . WWW '14, ACM , New York, NY, USA ( 2014 )

16. Wang , X. , McCallum , A. , Wei , X. : Topical n-grams: Phrase and topic discovery, with an application to information retrieval . In: Data Mining , 2007 . ICDM 2007 . Seventh IEEE International Conference on. pp. 697 { 702 . IEEE ( 2007 )