=Paper=
{{Paper
|id=Vol-1923/article-06
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-1923/article-06.pdf
|volume=Vol-1923
}}
==None==
Evolution of Semantically Identified Topics
Victor Mireles and Artem Revenko
Semantic Web Company, Vienna Austria,
{victor.mireles-chavez,artem.revenko}@semantic-web.com,
Abstract. Topics in a corpus evolve in time. Describing the way this
evolution occurs helps us to understand the change in the prominence
of concepts: one can gain intuition about what concepts become more
important in a given topic, which substitute others, or which concepts
become related. By defining topics as weighted collections of concepts
from a fixed taxonomy, it is possible to know if said evolution occurs
within a branch of taxonomy or if hitherto unknown semantic relation-
ships are formed / dissolved with time. In this work, we analyze a corpus
of financial news and reveal the evolution of topics composed of concepts
from the STW thesaurus. We show that using a thesaurus for building
representations of documents is useful. Furthermore, the different ab-
straction levels encoded in a taxonomy are helpful for detecting different
types of topics.
Keywords: topic discovery, taxonomy, thesaurus, topic evolution, topic
modelling
1 Introduction
Of the many dimensions that can be ascribed to text corpora, the topics that they
deal with is a very intuitive one for human readers. In a sense, topics constitute
subgraphs of a “platonic knowledge graph”, where only certain concepts and
certain relations exist. Two qualities of topics are of particular interest: they
reflect the intentions and context of the author of a text, and they are often
treated in different texts by different authors. For these reasons, understanding
their evolution in time can be treated as a proxy for studying the context of the
authors.
In the case of corpora of news articles, understanding the evolution of topics
can give insights into which entities become important in a given topic, or how
they loose their importance. Furthermore, the detection of emerging and fading
topics can be of interest for signalling major events.
In this work, we approach the study of topic evolution by using controlled,
semantically enriched vocabularies. With our method, it is possible to describe
the topics present in a certain time in terms of the topics present in the previous
time points. With this in hand, the method is able to recover stable topics that
2
are consistently dealt with in the news. Furthermore, detection of important
events that shift the composition of topics is also possible. Finally, we perform
an analysis of the topic-identification power of different levels of abstraction, as
defined by a thesaurus.
1.1 Topic Discovery
Topic discovery, also known as topic modelling, is the task of analyzing a corpus
and extracting from it clusters of terms that are semantically related.
When approached with statistical tools, the semantic relationship of the discov-
ered topics is deduced from their distributional properties: terms that co-occur
more often are deemed to be semantically related. Usually topic discovery is ap-
proached by first representing documents in terms of bags of words, n-grams[16]
or embedded representations[11], forming from such representation a document-
term matrix and, finally, inferring topics from said matrix. This last step is
performed by matrix decomposition methods such as SVD (a.k.a. LSA[9]) or
NMF[13], or by generative probabilistic models such as LDA[2] or PLSA[8]. The
outcome of topic discovery is a collection of sets of terms, called topics, such that
each document in the corpus can be assigned, often with a certain probability,
to one of the topics.
In the above described scenario, the only semantic relations between the terms
that we have access to are those statistically discovered based on the document-
term matrix. However, in many applications further semantic relations are known
between the terms. For example, if information about synonyms is known, topic
detection can be done by counting occurrences of synsets[6].
In this work, we aim at incorporating further semantic information into the topic
discovery process by use of a thesaurus. A thesaurus is a controlled vocabulary
whose concepts are organized according to their hypernym/hyponym relations.
In effect, it is a multihierarchical directed graph, where a node represents a
concept and edges represent hypernym or hyponym relations. Each concept is
assigned one or more labels: strings that can be matched against a document.
1.2 Topic Evolution
Studying topic evolution can be seen as a study of the history of ideas[7]. By
performing topic discovery on several corpora, each of which has a timestamp, it
is possible to see the transitions in interests in the author(s) of the corpora. This
might be useful to discover how a given topic is treated differently in different
times, thus constituting a proxy to study the evolution of semiotics. Several
approaches have been adopted to the study of topic evolution. Some perform
topic discovery independently in every corpus and only afterwards analyze the
relationships between them (e.g. [7]) while others perform topic discovery in a
corpus based on the topics discovered in the previous one (e.g. [1, 15, 14]). The
3
former approach is subject to the variation inherent to topic discovery methods,
which, in particular, can lead to topics being ”lost” from one corpus to the next
due to corpus quality or size. This can lead to the independently discovered
topics being difficult to compare. The latter approach has two main limitations:
1) that new ”flash topics” are hard to detect, and 2) they become over-sensible
to parameters, such as thresholds for estimating the number of topics. However,
in the preliminary experiments we have confirmed that dynamic topic models
and plain NMF with subsequent estimation of transition between different time
points yield similar results.
In this work, we present an intermediate approach. Topics are discovered in-
dependently for each corpus, and corpora in successive times are analyzed to
determine in which ways did the topics transit into others or appeared de-novo.
This second step allows us to describe the evolution of topics not just as the
evolution of sets of co-occurring terms, but rather as the merging and splitting
of existing topics. Hence, we are enabled to define the notion of persistent topic,
i.e. topics that appear in several consecutive corpora.
2 Data
The dataset we analyze is a financial news data set. The news come from a single
source (Bloomberg news) and are made available as supporting data in[4]. From
the original dataset, we took articles between January 2009 and November 2013,
which total 447,145 documents. We consider the documents within each week,
starting on Friday, a different corpus. Only a subset of the original dataset was
used, in order to guarantee that all corpora have at least 50 documents. The
sizes of corpora range between 50 and 4900, with a mean corpus size of 2589
documents.
The semantic relationships between concepts that we are using in this work
are those expressed by skos:narrower and skos:broader predicates in version
9.02 of the STW Thesaurus for Economics[3]. This thesaurus consists of 6221
concepts, of which 4108 are leaves (i.e. they have no narrower concepts). The
wide range of concepts included in the thesaurus make it ideal for analyzing the
corpus of financial news, specially because of its concept schemes of Geographic
Names and General Descriptors. For the purposes of this work, the predicate
skos:topConceptOf is considered to be equivalent to skos:broader.
3 Methods
Entity extraction Entity extraction was performed using PoolParty Semantic
Suite1 . In brief, the texts are pre-processed in the same manner as the labels
1
poolparty.biz
4
from the thesaurus, namely, stopwords are removed, tokens are lemmatized, n-
grams up to 4-grams are constituted. Then, a matching is done to identify all the
concepts appearing in the documents. Thus, for every concept c in the thesaurus
and every document, we have computed nd (c), the number of times any label of
c appears in document d. Finally, all the documents corresponding to the same
week were put together into a single corpus, which we will denote by Cw , where
w is the week number.
Representing documents For each document d in each corpus, two vector
representations are computed.
The first, which we call level 0 representation, contains information only about a
subset of all concepts that have no narrower concept in the thesaurus. We call this
set the set of leaves, and denote it by l1 , l2 , ...., , lm0 . The level 0 representation
of a document d is then a m0 -dimensional vector V0 (d) whose i’th entry is given
by V0 (d)[i] = ndn(ldi ) , where nd is the number of tokens in document d.
The second representation, called level 1 representation, contains only infor-
mation of those concepts that are broader concepts of some leaf. If we denote
that set by b1 , b2 , ...., bm1 , then this representation consists of a m1 dimensional
vector, whose i’th entry is given by
X nd (l)
V1 (d)[i] = (1)
nd
l∈L(bi )
where L(c) denotes the set of nodes which are narrower than concept c.
It must be noted that with both of these representations only the occurrences of
leaf concepts is considered. The difference being that in level 1 representations
the occurrence of concepts that are narrower of one same concept are grouped
together. With these two representations defined, we can represent each corpus
Cw by two matrices: A0 (w) and A1 (w), both of which have as many columns as
documents in corpus Cw , but with m0 and m1 rows respectively.
Detecting topics in a week For each week w, we compute a Non-Negative
Matrix Factorization (NMF)[10] on the two matrices A0 (w) and A1 (w). NMF
decomposes a matrix A ∈ Rm×n + into the product T S of two matrices, with T ∈
Rm×k
+ and S ∈ R k×n
+ . To choose the value of k, we first compute σ1 , σ2 , ...., σn all
singular values of A in descending order, and then choose the k that maximizes
σk −σk+1
σk+1 −σk+2 . This method is equivalent to the eigenvalue gap trick [5] in the case
of non-negative matrices. The NMF decompositions were computed using the
scikit-learn library[12] setting the parameter l1ratio to 1. We followed NMF by
a sparsifying step: the smallest threshold θ was found (using gradient descent)
such that if the matrix Tθ is the result of setting all entries of T whose value is
lower than θ to 0, then the density of Tθ S is not more than one half the original
density of A. From now on, we refer to Tθ simply as T .
5
The use of NMF yields, for each week w two pairs of matrices: T0 (w), S0 (w),
and T1 (w), S1 (w). We call the matrices T resulting from NMF the concept-topic
matrices. If T [c, j] > 0, we say that concept c belongs to topic j with a degree of
T [c, j]. Thus, for every week w we can compute two sets of topics, one for each
representation.
Detection of Topic Transitions For a representation q ∈ {0, 1} and two con-
m ×k
secutive weeks we get from the previous steps the matrices Tq (w) ∈ R+ q 1 and
mq ×k2
Tq (w+1) ∈ R+ , where k1 and k2 are the numbers of topics in different weeks.
In order to detect transitions between topics in these two consecutive weeks, we
solve the optimization problem of finding the matrix Mq (w) ∈ [0, 1]k1 ×k2 that
minimizes ||Tq (w)Mq (w) − Tq (w + 1)||2 . The resulting matrix Mq (w), which we
call transition matrix, expresses each topic in week w + 1 as a linear combination
of the topics in week w.
Fig. 1. Transition matrices between 6 consecutive weeks. These are the weeks in which
the Football World Cup flash topic is detected. The topic is not present in the first
week, and its index within each of the following weeks is 6, 10, 16, 26 and 2 (indicated
with red dots)
The following points help in the interpretation of transition matrices:
– If two topics merge into a new topic in the next week, then the new topics
column will have large entries in the rows corresponding to the two previous
topics.
6
– A new topic in week w + 1 will correspond to a column whose entries are all
small: it is not similar to any topic in the previous week
– A topic whose concepts don’t change between two consecutive weeks can be
detected by a column and row that both have a single entry close to 1.
– One can think of topic transition as the process of the topic from the previous
week distributing its weight into the topics of the current week. The topic
cannot give more weight than it has.
With a set of consecutive transition matrices Mq (w), Mq (w + 1), . . . , Mq (w + g),
it is possible to detect topics which remain stable for several weeks: they will be a
sequence of indices t1 , t2 , ...., tg ) such that 1−Mq (w)[ti , ti+1 ] ≤ α for i = 1...g −1
and some small α. We consider a topic to be a stable topic if the above condition
holds for g >= 5 with α = 0.2, i.e. if the topic keeps approximately with the
same concept composition for at least 4 weeks. Figure 1 is an example of a set
of transitions matrices that exhibit a stable topic.
4 Results
We decomposed all corpora into topics leading to different number of topics per
week (Fig 2). After computing the corresponding transition matrices we are able
to detect the appearance of new topics as well as several stable topics. Among
them, we can talk of persistent and flash topics.
50
Representation 0 Representation 1
total total
40 new 40 new
Number of topics
Number of topics
30 30
20 20
10 10
0 0
0 50 100 150 200 250 0 50 100 150 200 250
Week number Week number
Fig. 2. Number of topics discovered for every week. The dips are due mostly to weeks
in which few documents were collected. A topic is considered new if it belongs to no
earlier stable topic.
Persistent topics are topics that the news source treats regularly. While interrup-
tions in the data (weeks with few articles) sometimes fragmented these topics,
in the sense of our definition of stable topics, they were quick to recover. Several
stable topics represented through their most important concepts are presented
in Table 1.
7
Table 1. Twelve most important concepts in Persistent topics. Concept are sorted in
decreasing order of average (per week) importance. Interestingly, Bayesian Inference
appears as part of a topic, because Prior is an skos:altlabel for this concept in the
thesaurus.
Stocks Futures Japanese Euro Meat
Market Future Yen Germans Beef
Stock market Soybean Tokyo German Light
Product Crops Japanese European Bayesian inference
International Gold Loss Greek Cattle
Purchase Wheat Electronics Greeks Price
Market value Department Newspaper Greece Department
Price Sugar Sales Berlin Plants
Swap Rubber Dividend Nation Flavour
Benchmarking Singapore Services Bailout Sales
Hedging Cocoa Product Economy Product
Loss Cattle Semiconductor Portuguese Import
Future Palms Plants London Tokyo
Furthermore, we were able to detect Flash topics, that is topics which relate to
specific, transient events. We found the following flash topics in level 0 represen-
tation:
1. the 2010 South Africa World Football Cup,
2. the 2010 artillery fire exchange in the Korean peninsula.
3. the 2011 Drought,
4. the Arab Spring,
5. the Fukushima Daiichi nuclear disaster,
The most frequently mentioned concepts in these topics can be seen in table 2.
The evolution of topics can best be exemplified by the Football World Cup
example. In table 3, the change in the topic across weeks is shown. Notice how
the number of countries decreases, and those which remain are also those which
remained in the tournament. Let us recall that the tournament ended on the
11th of July.
Many of these topics were found also in the level 1 representation. In general,
the level 1 representation yields longer stable topics, as can be seen on the length
distributions of stable topics in Figure 3. It is worth mentioning, that a topic
which is very fragmented (in time) in the level 0 representation is less so in the
level 1 representation: that of the Japanese market. This suggests that looking at
more abstract concepts can increase stability of topic detection. It is important
that, while using the broader concept of a set of leaves increases stability, there
is no evidence that similar topics are confounded by this process. This is an
indication that topics are not necessarily matching branches of the taxonomy
but, rather, are combinations of topics from across the thesaurus.
8
Table 2. Twelve most important concepts in Flash topics. Concepts are sorted in
decreasing order of average (per week) importance.
Football Korea Drought Arab Spring Fukushima
6 Weeks 5 Weeks 5 Weeks 8 Weeks 6 Weeks
World Koreans Wheat Libya Plants
Sport event Korean Crops International Nuclear energy
South Africa South Korea Drought Nation Manufacturing plant
Football North Korea Soybean Foreign Electricity
African South Korean World Arabs Cooling
Matching South Koreans Rice Arab Earthquake
Coaching Officials Food price Egypt Greenhouse gas emis.
South African Nation Nation Air Nuclear power plant
South Africans Island Egypt West Asia Nuclear fuel
Brazil Foreign Department Tunisia Order
Argentina World Australia Officials Process
Netherlands Chinese International African Officials
Mexico Fire Province Industrial action Fire
For the same reason, topics in consecutive weeks which are not deemed persistent
under the level 0 representation, become so in the level 1 representation. It is
thus important to choose carefully the level of granularity of the concepts that
will be used to annotate a corpus with the aim of persistent topic detection. It
must also be noted that detecting topics based solely on the words (i.e. without
a controlled vocabulary) does not provide this possibility.
Finally preliminary results show that, on average, concepts gained by a topic
during its evolution are closer in the thesaurus to already existing topics that is
expected at random.
5 Conclusions and Future Work
We have presented a method that is able to detect both persistent and tran-
sient topics in news sources. Interestingly, we have shown that it is possible to
detect such topics both when annotating documents only with leaves from a
thesaurus, and when annotating them also with those concepts directly above
leaves. Furthermore, we have shown that the relatively simple NMF method
is able to detect stable topics, and that this stability can also be captured by
our proposed method for computing topic transitions. By considering the news
articles in each week as independent corpora, we are able to detect short-lived
topics, that would otherwise be lost in a global topic modeling.
The fact that topics are detectable in both representations is an indication that
topics are not easily confounded with each other when one considers more ab-
stract categories. This is an interesting result, for it shows that the concepts
9
comprising a topic are distributed widely enough across the thesaurus that ab-
stracting them just one level still allows for their detection. In a sense, the way
that the concepts have been organized in the thesaurus do not match the real-
world occurring topics. We believe that this result can serve as a measure of the
Table 3. Evolution of the concepts in the Football World Cup Topic. The start date
of each corpus is shown in the header. All concepts belonging to the topic are shown,
sorted in decreasing importance.
2010-06-07 2010-06-14 2010-06-21 2010-06-28 2010-07-05 2010-07-12
World World World World World South Africa
Sport event Sport event Sport event Sport event South Africa African
Football Matching Football Football Sport event World
Matching Football South Africa Coaching Football South African
South Africa South Africa Coaching South Africa African South Africans
Slovenia Brazil Mexico Matching South African Platinum
Italy Coaching France Brazil South Africans Football
Algeria Argentina Brazil Argentina Humans Sport event
Australia Mexico Argentina Ghana Uganda British
Paraguay France African Uruguay Black people Iron
Netherlands French Uruguay Netherlands Somalia Iron ore
Ghana Slovenia Ghana Spain Matching International
Nation Cote d’Ivoire Netherlands Punishment International Golf
Cameroon Algeria French Paraguay American AIDS
Coaching Portugal Italy African Black people
African African Loss Loss
Brazil Uruguay American Nigeria
London Serbia Slovenia Gambling
Serbia Italy Slovakia Federation
Industrial action New Zealand Chile Nation
South Korea Nigeria Industrial action
Punishment South African American
North Korea South Africans International
Economy Slovakia
Greece Punishment
Spain Greece
South African American
South Africans Chile
Police Netherlands
Islamic
Dutch
New Zealand
Slovakia
Argentina
Mexico
Portugal
10
Level 0 Representation Level 1 Representation
50 50
40 40
30 30
20 20
10 10
0 0
5 15 25 35 45 55 65 75 85 95 5 15 25 35 45 55 65 75 85 95
Stable topic duration Stable topic duration
Fig. 3. Distribution of the lengths of stable topics. Level 1 representation shows both
more and more long lasting stable topics. Considering the broader of a leaf concept
thus reduces noise in the description of documents.
generality of a thesaurus, and its applicability to analyzing texts from various
sources.
This work represents initial results in the analysis of topic transitions with the
help of a thesaurus. In future work we intend to build on top of the current
results and extend the methodology. In particular, the current results motivate
us to:
1. Investigate and compare representation of topics in different levels. The pre-
liminary observations suggest that several stable topics at level 0, in non-
overlapping weeks, could be merged into a single topic at level 1.
2. The topics at level 1 appear to be more stable. This could be useful in the
case of limited data, when the detailed topic could fade away.
3. The different levels of representation could prove to be useful for people with
different backgrounds. Namely, we expect that more detailed topics could be
of interest for experts in the field, whereas the more general topics could give
a good overview for less experienced readers.
4. We aim at improving the transition computation with the help of taking the
distances between concepts into account and employing methods similar to
soft cosine similarity.
5. Finally, statistical analysis is required to confirm or observations that the
concepts gained during topic evolution are more likely to be close, in the
thesaurus, to concepts already in a topic.
Acknowledgements The work is partially supported by the PROFIT (http:
//projectprofit.eu/) project. Part of the European Commission’s H2020 Frame-
work Programme, grant agreement no. 687895,
11
References
1. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd in-
ternational conference on Machine learning. pp. 113–120. ACM (2006)
2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine
Learning research 3(Jan), 993–1022 (2003)
3. Borst, T., Neubert, J.: Case study: Publishing stw thesaurus for economics as
linked open data. W3C Semantic Web Use Cases and Case Studies (2009)
4. Ding, X., Zhang, Y., Liu, T., Duan, J.: Using structured events to predict stock
price movement: An empirical investigation. In: EMNLP. pp. 1415–1425 (2014)
5. Djurdjevac, N., Sarich, M., Schütte, C.: Estimating the eigenvalue error of markov
state models. Multiscale Modeling & Simulation 10(1), 61–81 (2012)
6. Ferrugento, A., Alves, A., Oliveira, H.G., Rodrigues, F.: Towards the improvement
of a topic model with semantic knowledge. In: Portuguese Conference on Artificial
Intelligence. pp. 759–770. Springer (2015)
7. Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic
models. In: Proceedings of the conference on empirical methods in natural language
processing. pp. 363–371. Association for Computational Linguistics (2008)
8. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Ma-
chine learning 42(1), 177–196 (2001)
9. Landauer, T.K., Laham, D., Foltz, P.W.: Learning human-like knowledge by sin-
gular value decomposition: A progress report. In: Advances in neural information
processing systems. pp. 45–51 (1998)
10. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-
torization. Nature 401(6755), 788 (1999)
11. Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with
latent feature word representations. Transactions of the Association for Computa-
tional Linguistics 3, 299–313 (2015)
12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
13. Recht, B., Re, C., Tropp, J., Bittorf, V.: Factoring nonnegative matrices with linear
programs. In: Advances in Neural Information Processing Systems. pp. 1214–1222
(2012)
14. Saha, A., Sindhwani, V.: Learning evolving and emerging topics in social media: A
dynamic nmf approach with temporal regularization. In: Proceedings of the Fifth
ACM International Conference on Web Search and Data Mining. pp. 693–702.
WSDM ’12, ACM, New York, NY, USA (2012)
15. Vaca, C.K., Mantrach, A., Jaimes, A., Saerens, M.: A time-based collective fac-
torization for topic discovery and monitoring in news. In: Proceedings of the 23rd
International Conference on World Wide Web. pp. 527–538. WWW ’14, ACM,
New York, NY, USA (2014)
16. Wang, X., McCallum, A., Wei, X.: Topical n-grams: Phrase and topic discovery,
with an application to information retrieval. In: Data Mining, 2007. ICDM 2007.
Seventh IEEE International Conference on. pp. 697–702. IEEE (2007)