<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Online Topic Modeling: Keeping Track of News Topics for Social Good</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zahra Ahmadi</string-name>
          <email>zaahmadi@uni-mainz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sophie Burkhardt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Kramer</string-name>
          <email>kramer@informatik.uni-mainz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut fur Informatik, Johannes Gutenberg-Universitat</institution>
          ,
          <addr-line>Staudingerweg 9, 55128, Mainz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The refugee crisis has become an important, albeit controversial, issue for European countries. There have been many debates in favor or against accepting refugees in social media; however, there is little work on the interpretation of data in this regard. In this paper, we propose an online topic modeling approach which is able to evolve over time and nds the most important topics at each time slot. Our study shows that outside events have a visible impact on the media and this perception can be changed or evolving over time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Online Topic Modeling</title>
      <p>The generative process for Latent Dirichlet Allocation (LDA) is given as follows:
Dir( );</p>
      <p>Dir( ); z</p>
      <p>Mult( ); w</p>
      <p>
        Mult( z)
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
      </p>
      <p>For each topic k, a multinomial distribution k over words is drawn from
a Dirichlet distribution with parameter . For each document d, a distribution
over topics is drawn from a Dirichlet with parameter . For each word wdi in
document d a topic indicator zdi is drawn from the multinomial distribution .
Finally, the word w is drawn from the multinomial distribution zdi associated
with the chosen topic.</p>
      <p>
        To track the topics online, we separate the data into di erent time slices
D = fD1; : : : ; Dt 1; Dtg. Following the method proposed by AlSumait et al. [1],
for each time slot t our method learns a topic model by Gibbs sampling [3] where
the parameters are a weighted mixture of the matrices 1; : : : ; t 1 from the
previous time slots:
kt =
t 1
X !t0 tk0 ;
t0=1
t = (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) t 1 +
      </p>
      <p>
        ^t;
t = (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) t 1 +
t 1:
where !t is the weight associated with time slot t.
      </p>
      <p>In practice, this means that one has to keep all matrices t associated with
all time slots in memory to compute the weighted sum for the current time slot.
This is ine cient in terms of memory and runtime and not in the spirit of a true
online method. In their experiments, AlSumait et al. [1] therefore only use the
previous time slot, meaning !t is zero for all other time slots. This makes the
method more practically relevant; however, it introduces a problem: Consider
the case where a certain topic occurs in one time slot, is absent in the next time
slot, and reoccurs in the next. In this case, the model will forget everything from
the previous occurrence of the topic since it only takes the previous time slot
into account. This makes the results highly dependent on the size of the data
slices and the content of the data.</p>
      <p>Our solution is based on the de nition of variational Bayes online topic
models [4]. In online variational Bayes, instead of taking samples, a natural gradient
is calculated. After each batch, the model is then updated as
where is a real-valued update parameter in the [0; 1] interval and ^ is the
estimate for based on the current batch. In our model, we adopt this strategy
and only use it for updating the parameter :
(2)
(3)
(4)</p>
      <p>This means that we let the prior parameter converge to a stationary point,
whereas t is speci c to a certain time slot t. Thus, we can analyze the data in
a certain time slot while inducing the model to keep the topics stable over time
without having to save any of the previous matrices . In contrast to the online
method by Ho man et al. [4], our method learns topics that are only based on
the data from the current time slot, making it easier to track changes or detect
speci c events.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>We extracted a set of news articles with a keyword related to refugees from
January 2016 to May 2017 from German media. We preprocessed the data by
removing numbers, stop words and words with one letter and made all letters
lower case. Those instances which become empty by preprocessing are removed;
hence, the dataset is reduced to 208 ; 683 articles with 71 ; 633 features. To run
o ine LDA and the online topic modeling method, we set the number of topics
to 100 . The time slots contain 10 ; 000 instances, and the method is repeated
100 times for each slot. The update parameter is set to 10010:9 , according to
the instructions provided by Ho man et al. [4].</p>
      <p>Figure 1 illustrates the results of o ine LDA and the topic evolution of the
online topic modeling method as a word cloud for one of the topics which is
mainly about the AFD party (a right-wing populist political party in Germany)
and their news related to refugees. Each box presents the English translation of
the top 20 most frequent words of the current topic. The dates show the start
point of the time slot, and we represent every third slot in this gure. We can see
that although the topic changes over time, it is interestingly all about the AFD
party news. The advantage of this model over the previous online LDA models
(e.g. [4]) is that it puts more emphasis on the current batch than updating the
previous topic model incrementally with a marginal e ect of the new batch.</p>
      <p>Looking into the most frequent words of each temporal topic, we can observe
that in each period, based on the upcoming events, some topics are highlighted:
e.g., in the Landtag election of the state North Rhine Westphalia, which was
held on 14 May 2017, Helmut Seifen (the AFD politician) was elected. We can
already see him on top of the news related to refugee politics on 2017 03 31;
or because of the importance of the Bundestag election for a young party like
AFD, we can observe many discussions related to that since the beginning of
2017, although the election is held only in September 2017. This topic is one
of the 100 resulting topics by our method. We observe some other topics about
other parties (e.g., SPD and CDU), some topics related to refugee integration,
some about job markets or even about women and children. However, not all of
the topics are that well-de ned.
4</p>
      <p>Conclusion, Challenges, and Future Work
We proposed an online topic modeling method to nd the topics related to
refugees in German media. During our experiments, we faced several challenges.
Our rst goal was to nd a categorization of the reasons for being against or
in favor of accepting the refugees among di erent opinions. Although the model
petry eu
partei
alternative
steht
jörg
landtag</p>
      <p>landtagswahl
deutschland
spitzenkandidatin
januar
aklulackre
fbundeskanzlerin
d
politiker
cdu
angela
türkei grünen
bundestagswahl alexander
üchtlinge migration</p>
      <p>merkel
2016-01-18
afd men pictures
germany members</p>
      <p>pegida petry
fugitives leave last</p>
      <p>lower saxony
member april party
belongs help frauke
saxony summer
disappeared
2017-05-01
vorpommern
mecklenburg afd
strongest force
parliament berlin
union germany party
according to cdu
parties brussels</p>
      <p>problems
social democratic
landtag election
faction commission
see
2017-04-16
afd cdu parties
bundestag politics
party coalition
answer spd grunen
linke wagenknecht
linnemann ask vote
vacuum linken get
contribute union
2017-03-31
afd refugee politics</p>
      <p>seifen helmut
election campaign
worker mobilization
leave party parties
citizens currently
strong records</p>
      <p>elections
refugee numbers
times terror strife
left party
2017-02-06
afd party parties
germany april cdu
bundestag election
election left
right populist
alternative saar
landtag election
election campaign</p>
      <p>wurttemberg
september politician
polls baden march</p>
      <p>2017-03-08
afd german members
government majority
keeps participates
stop planned us
president exit
climate change paris
energy politics
climate agreement
terminate proved
climate protection plan
human made
illustrates the output of a batch LDA on the data while the text boxes show the output
of our proposed method.</p>
      <p>nds interesting topics in the data, this goal remains unsolved. Another
unsuccessful attempt was to nd an unsupervised method to cluster di erent sources
based on their opinions expressed in their articles with the hope of nding their
political view. As a future work one could develop a semi-supervised approach
to build a topic model which can reach these goals.
mining text streams with applications to topic detection and tracking. In: Eighth
IEEE International Conference on Data Mining. pp. 3{12 (2008)
Renso, C.: Sentiment-enhanced multidimensional analysis of online social networks:
Perception of the mediterranean refugees crisis. In: IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining. pp. 1270{1277 (2016)
tional Academy of Sciences of the United States of America. vol. 101, pp. 5228{5235.</p>
      <p>National Academy of Sciences (2004)</p>
      <p>In: Advances in neural information processing systems. pp. 856{864 (2010)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>AlSumait</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>Barbara</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domeniconi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>On-line lda: Adaptive topic models for 2</article-title>
          .
          <string-name>
            <surname>Coletto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esuli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lucchese</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muntean</surname>
            ,
            <given-names>C.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nardini</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perego</surname>
          </string-name>
          , R., 3. Gri ths, T.L.,
          <string-name>
            <surname>Steyvers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Finding scienti c topics</article-title>
          .
          <source>In: Proceedings of the Na4</source>
          .
          <article-title>Ho man</article-title>
          , M.,
          <string-name>
            <surname>Bach</surname>
            ,
            <given-names>F.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          :
          <article-title>Online learning for latent dirichlet allocation</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>