=Paper=
{{Paper
|id=Vol-1960/paper5
|storemode=property
|title=Online Topic Modeling: Keeping Track of News Topics for Social Good
|pdfUrl=https://ceur-ws.org/Vol-1960/paper5.pdf
|volume=Vol-1960
|authors=Zahra Ahmadi,Sophie Burkhardt,Stefan Kramer
}}
==Online Topic Modeling: Keeping Track of News Topics for Social Good==
Online Topic Modeling: Keeping Track of News
Topics for Social Good
Zahra Ahmadi, Sophie Burkhardt, and Stefan Kramer
Institut für Informatik, Johannes Gutenberg-Universität,
Staudingerweg 9, 55128, Mainz, Germany
zaahmadi@uni-mainz.de, burkhardt,kramer@informatik.uni-mainz.de
Abstract. The refugee crisis has become an important, albeit contro-
versial, issue for European countries. There have been many debates in
favor or against accepting refugees in social media; however, there is lit-
tle work on the interpretation of data in this regard. In this paper, we
propose an online topic modeling approach which is able to evolve over
time and finds the most important topics at each time slot. Our study
shows that outside events have a visible impact on the media and this
perception can be changed or evolving over time.
1 Introduction
Europe has witnessed a large movement of migrants and refugees from Africa and
the Middle East in recent years. The arrival wave started in August 2015, and
since then it has been in the spotlight of the media by reporting an increasing
number of events and heated and polarized debates relevant to this phenomenon.
The implications of this crisis are complex and wide, however, data mining ex-
perts just recently considered the interpretation of the media: Coletto et al. [2]
proposed an adaptive framework to analyze the spatial, temporal and sentiment
aspects of a polarized topic discussed in online social media; the GDELT data1
was used to answer the question of whether the Arab Spring sparked a wave of
global protests2 ; and Data For Democracy built a tool capable of tracking and
analyzing refugees and other people forced to evacuate their homes3 .
In this work, our goal is to analyze media from Germany as one of the highly
affected European countries to address the following questions: “What are the
main concerns of each party or news source? How does the perception evolve
over time? How is the perception influenced by events? How similar are different
parties and sources in this aspect?”. As a result, we propose an online topic
modeling method to keep track of the topics appearing over time and evaluate
the results on a relatively large dataset from German media.
1
http://gdeltproject.org/data.html#rawdatafiles
2
https://foreignpolicy.com/2014/05/30/did-the-arab-spring-really-spark-a-wave-of-global-protests/
3
https://www.un.org/press/en/2017/pi2207.doc.htm
1
2 Online Topic Modeling
The generative process for Latent Dirichlet Allocation (LDA) is given as follows:
φ ∼ Dir(β), θ ∼ Dir(α), z ∼ Mult(θ), w ∼ Mult(φz ) (1)
For each topic k, a multinomial distribution φk over words is drawn from
a Dirichlet distribution with parameter β. For each document d, a distribution
over topics θ is drawn from a Dirichlet with parameter α. For each word wdi in
document d a topic indicator zdi is drawn from the multinomial distribution θ.
Finally, the word w is drawn from the multinomial distribution φzdi associated
with the chosen topic.
To track the topics online, we separate the data into different time slices
D = {D1 , . . . , Dt−1 , Dt }. Following the method proposed by AlSumait et al. [1],
for each time slot t our method learns a topic model by Gibbs sampling [3] where
the parameters β are a weighted mixture of the matrices φ1 , . . . , φt−1 from the
previous time slots:
t−1
X 0 0
βkt = ω t φtk , (2)
t0 =1
where ω t is the weight associated with time slot t.
In practice, this means that one has to keep all matrices φt associated with
all time slots in memory to compute the weighted sum for the current time slot.
This is inefficient in terms of memory and runtime and not in the spirit of a true
online method. In their experiments, AlSumait et al. [1] therefore only use the
previous time slot, meaning ω t is zero for all other time slots. This makes the
method more practically relevant; however, it introduces a problem: Consider
the case where a certain topic occurs in one time slot, is absent in the next time
slot, and reoccurs in the next. In this case, the model will forget everything from
the previous occurrence of the topic since it only takes the previous time slot
into account. This makes the results highly dependent on the size of the data
slices and the content of the data.
Our solution is based on the definition of variational Bayes online topic mod-
els [4]. In online variational Bayes, instead of taking samples, a natural gradient
is calculated. After each batch, the model is then updated as
φt = (1 − ρ)φt−1 + ρφ̂t , (3)
where ρ is a real-valued update parameter in the [0, 1] interval and φ̂ is the
estimate for φ based on the current batch. In our model, we adopt this strategy
and only use it for updating the parameter β:
β t = (1 − ρ)β t−1 + ρφt−1 . (4)
This means that we let the prior parameter β converge to a stationary point,
whereas φt is specific to a certain time slot t. Thus, we can analyze the data in
a certain time slot while inducing the model to keep the topics stable over time
2
without having to save any of the previous matrices φ. In contrast to the online
method by Hoffman et al. [4], our method learns topics that are only based on
the data from the current time slot, making it easier to track changes or detect
specific events.
3 Experiments
We extracted a set of news articles with a keyword related to refugees from
January 2016 to May 2017 from German media. We preprocessed the data by
removing numbers, stop words and words with one letter and made all letters
lower case. Those instances which become empty by preprocessing are removed;
hence, the dataset is reduced to 208 , 683 articles with 71 , 633 features. To run
offline LDA and the online topic modeling method, we set the number of topics
to 100 . The time slots contain 10 , 000 instances, and the method is repeated
100 times for each slot. The update parameter ρ is set to 10010.9 , according to
the instructions provided by Hoffman et al. [4].
Figure 1 illustrates the results of offline LDA and the topic evolution of the
online topic modeling method as a word cloud for one of the topics which is
mainly about the AFD party (a right-wing populist political party in Germany)
and their news related to refugees. Each box presents the English translation of
the top 20 most frequent words of the current topic. The dates show the start
point of the time slot, and we represent every third slot in this figure. We can see
that although the topic changes over time, it is interestingly all about the AFD
party news. The advantage of this model over the previous online LDA models
(e.g. [4]) is that it puts more emphasis on the current batch than updating the
previous topic model incrementally with a marginal effect of the new batch.
Looking into the most frequent words of each temporal topic, we can observe
that in each period, based on the upcoming events, some topics are highlighted:
e.g., in the Landtag election of the state North Rhine Westphalia, which was
held on 14 May 2017, Helmut Seifen (the AFD politician) was elected. We can
already see him on top of the news related to refugee politics on 2017 − 03 − 31;
or because of the importance of the Bundestag election for a young party like
AFD, we can observe many discussions related to that since the beginning of
2017, although the election is held only in September 2017. This topic is one
of the 100 resulting topics by our method. We observe some other topics about
other parties (e.g., SPD and CDU), some topics related to refugee integration,
some about job markets or even about women and children. However, not all of
the topics are that well-defined.
4 Conclusion, Challenges, and Future Work
We proposed an online topic modeling method to find the topics related to
refugees in German media. During our experiments, we faced several challenges.
Our first goal was to find a categorization of the reasons for being against or
in favor of accepting the refugees among different opinions. Although the model
3
petry
partei
alternative
eu
deutschland
spitzenkandidatin
januar
steht
jörg
landtag
afd
wähler
landtagswahl
parteien
frauke euro
klar
lucke
üchtlinge
bundeskanzlerin
türkei
bundestagswahl
angela
migration
politiker
cdu
grünen
alexander
merkel
gauland
2017-02-06
2016-07-05 2017-01-11
2016-01-18 afd party parties
afd party convention afd party
afd men pictures germany april cdu
april elected named bundestag election
germany members bundestag election
remain perceived petry percent polls
pegida petry election left
germany best racism april poll
fugitives leave last right populist
relationships party convention
lower saxony alternative saar
nationalism level frauke union lucke
member april party landtag election
effort color government cdu
belongs help frauke election campaign
folkloristic grünen linken
saxony summer württemberg
from rhine humor difficult wing köln
disappeared september politician
allegedly british bernd
polls baden march
2017-05-01
2017-03-31 2017-03-08
vorpommern
afd refugee politics afd german members
mecklenburg afd 2017-04-16
seifen helmut government majority
strongest force afd cdu parties
election campaign keeps participates
parliament berlin bundestag politics
worker mobilization stop planned us
union germany party party coalition
leave party parties president exit
according to cdu answer spd grünen
citizens currently climate change paris
parties brussels linke wagenknecht
strong records energy politics
problems linnemann ask vote
elections climate agreement
social democratic vacuum linken get
refugee numbers terminate proved cli-
landtag election contribute union
times terror strife mate protection plan
faction commission
left party human made
see
Fig. 1: An example of topic evolution on a topic relevant to AFD party. The word cloud
illustrates the output of a batch LDA on the data while the text boxes show the output
of our proposed method.
finds interesting topics in the data, this goal remains unsolved. Another unsuc-
cessful attempt was to find an unsupervised method to cluster different sources
based on their opinions expressed in their articles with the hope of finding their
political view. As a future work one could develop a semi-supervised approach
to build a topic model which can reach these goals.
References
1. AlSumait, L., Barbará, D., Domeniconi, C.: On-line lda: Adaptive topic models for
mining text streams with applications to topic detection and tracking. In: Eighth
IEEE International Conference on Data Mining. pp. 3–12 (2008)
2. Coletto, M., Esuli, A., Lucchese, C., Muntean, C.I., Nardini, F.M., Perego, R.,
Renso, C.: Sentiment-enhanced multidimensional analysis of online social networks:
Perception of the mediterranean refugees crisis. In: IEEE/ACM International Con-
ference on Advances in Social Networks Analysis and Mining. pp. 1270–1277 (2016)
3. Griffiths, T.L., Steyvers, M.: Finding scientific topics. In: Proceedings of the Na-
tional Academy of Sciences of the United States of America. vol. 101, pp. 5228–5235.
National Academy of Sciences (2004)
4. Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation.
In: Advances in neural information processing systems. pp. 856–864 (2010)
4