<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Temporal Evolution of the Migration-related Topics on Social Media?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yiyi Chen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Genet Asefa Gesese</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehwish Alam</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIZ Karlsruhe</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Karlsruhe Institute of Technology, Institute AIFB</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leibniz Institute for Information Infrastructure</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This poster focuses on capturing the temporal evolution of migration-related topics on relevant tweets. It uses Dynamic Embedded Topic Model (DETM) as a learning algorithm to perform a quantitative and qualitative analysis of these emerging topics. TweetsKB is extended with the extracted Twitter dataset along with the results of DETM which considers temporality. These results are then further analyzed and visualized. It reveals that the trajectories of the migration-related topics are in agreement with historical events. The source codes are available online: https://bit.ly/3dN9ICB.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Social media has become one of the most widely used channels for people to
exchange opinions about social events around the globe. It provides one of the
most useful resources about social interactions and trending events on important
topics such as migration, climate change, political elections, etc. Over the last
decade, migration has become one of the most controversial topics in Europe.
This poster analyses the Twitter data from the destination countries in Europe,
to gain insight into migration-related events. The analysis of the temporal
trajectory of the topics in migration-related tweets shows the shifts in the attention
of people over time.</p>
      <p>For example, people's interest in \Syrian refugees" peaked in 2016 in Europe,
while later it decayed. With the emergence of COVID-19 in 2020, the attention
shifted to \border control". By analyzing the co-occurring topic words, the
driving factors could be identi ed. As shown in Fig. 2, from 2020 onwards the
probability trajectories of co-occurring words \COVID", \migrant", \border" have
similar trends, which indicates the pandemic as a causal factor of debates about
border control, and further a ecting the migration ows in this period.</p>
      <p>Several e orts have been made to provide a Knowledge Base (KB) about
tweets to make the data more accessible and usable for further research. One
? This work is a part of ITFLOWS project which has received funding from the EU</p>
      <p>
        H2020 research and innovation program under grant agreement No 882986.
Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
such e ort is TweetsKB [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which is a KB containing 1.5 billion tweets
spanning through 5 years, including entity and sentiment annotations.
MigrAnalytics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is another such e ort which uses TweetsKB as a starting point to
analyze migration-related tweets using entity-based approaches. While both provide
time-series data, no attempt has been made to analyze the temporal evolution
of migration-related topics, which would help us to identify the changing driving
factors of migrations in a temporal dimension. This study completes
MigrationsKB (MGKB) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] with the tweets including topics evolving through time
using advanced methods based on neural networks.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Temporal Evolution of Topics in Migrations</title>
      <p>Tweet Extraction. Keyword-based methods are used to extract tweets from
Twitter. The words \immigration" and \refugee" are used as the seed words
which are then enriched with top-50 most similar words with pre-trained Word2Vec
and fastText embeddings. The nal set of keywords are then manually veri ed.
The collected tweets are from 11 destination countries, where most refugees in
Europe are hosted. These countries are selected by ranking them according to
the frequency of the asylum seekers obtained from Eurostat3. These countries
include the United Kingdom, Germany, Spain, Poland, France, Sweden, Austria,
Hungary, Switzerland, Netherlands, and Italy. The nal number of extracted
tweets spanning over 7 years (2013 - 2020) is 384891 which are then
preprocessed by removing user mentions, reserved words (i.e., RT), emojis, smileys,
URLs, stop words, punctuations, numerical tokens, HTML tags. Moreover, the
tokens except hashtags are lemmatized. Finally, the tweets with at least two
words are retained.</p>
      <p>
        Dynamic Topic Modeling. Topic modeling is used to extract hidden
semantics in textual documents. The most widely used algorithm for topic modeling is
Latent Dirichlet Allocation (LDA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, it fails in the presence of large
vocabularies. Embedded Topic Model (ETM) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] aims to solve this problem by
employing word embeddings. It de nes each topic as a vector on the word
embedding space, and then represents the per-topic proportion as joint information
from words and topic embeddings. DETM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is developed to extend ETM, which
uses a probabilistic time series to model the topics varying over time. Similar
to ETM, the probability of each word under DETM is a categorical
distribution whose parameters depend on joint information from word embeddings and
per-topic proportion, however, the topic proportions vary over time.
      </p>
      <p>
        In DETM, the word and the topic embeddings are trained in parallel on the
extracted Twitter data with a time variable on a prede ned number of topics.
The data is split into the train (85%), validation (5%), and test (10%) (i.e.,
306077, 18011, and 35766 tweets respectively) with a vocabulary size of 20865.
DETM is optimized on a document completion task [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the best model is used
to generate topics from the tweets. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors choose 50 as the number
3 https://ec.europa.eu/eurostat
      </p>
      <p>Temporal Evolution of the Migration-related Topics on Social Media
of topics. For the sake of completion, the experiments are also conducted with
25 and 75 topics in this study. However, after applying the pretrained DETM
on the extracted Twitter data, many redundant topics are found, i.e., not being
classi ed to the tweets. Hence, the models with lower numbers of topics are
applied. As shown in Table 1, only models trained with 5 and 10 topics have all
the topics classi ed to the tweets, which are used for further analysis.</p>
      <p>
        The trained word embeddings from DETM represent the similarity between
the words in the tweets. To select the tweets that are relevant to the topic of
migration, the centroid of the word vectors of keywords is clustered with
KMeans, and then the cosine similarity between the centroid and the trained
topic proportions of tweets is calculated (i.e., the con dence score). The relevant
tweets have a con dence score greater than 0. As shown in Fig. 1a, for both
models, the scores are normally distributed, while the topic proportions of the
tweets are trained under a Gaussian noise [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. That means, the distance between
the tweets and the centroid has the same distribution as the topic proportions.
Therefore, \migration" is the central topic of the tweets.
      </p>
      <p>In Fig. 1b, the distribution of the tweets stays almost the same across time
and across topics. However, in Fig. 1c, after ltering the non-relevant tweets,
the tweets of the 5th topic are ltered out, while the tweets of the 8th topic
are more evenly distributed across time. After the automatic selection, since the
model trained with 10 topics provide richer topics across the time, the model is
selected for further analysis. Fig. 1d shows the distribution of the non-relevant
tweets per country.</p>
      <p>Fig. 2 shows the evolution of word probability across time for four di erent
speci c topics (i.e., the titles for the plots) learned by DETM. For each, a set
of words whose probability shift aligns with historical events are presented. For
example, for \Syria, Refugee, and EU", the probability of the words shift in
the same manner from 2014 to 2017, re ecting the event of a large amount
of Syrian refugees entering the European Union, which peaked from 2015 to
2016. In 2018, the Brexit decisions were made, at the same time, the media
outlets claimed a strong connection between Brexit and Trumpism, which is
also re ected in \Brexit and Refugee". The discussion of the border has a strong
correlation with the emergence of COVID-19 since 2020, as shown in \Migrant
and Border".</p>
      <p>Extending TweetsKB. The tweets along with the assigned topics are stored
in an extension of TweetsKB. Moreover, the words associated with each topic</p>
      <p>Fig. 1: Tweets assigned with Topics
(a) Distribution of Con dence Score for Models with 5 and 10 Topics
(b) The Distribution of Tweets before and after Filtering for Model
with 5 Topics
(c) The Distribution of Tweets before and after Filtering for Model
with 10 Topics</p>
      <p>(d) The Distribution of Non/Relevant Tweets per Country
across time are also added using new classes and properties, please refer to the
GitHub page for more details. This information can be queried using SPARQL.
The following query shows the evolution of words of a given topic.
SELECT ?TopicWords ?year WHERE{
?topic sioc:id "7"; mgkb:topicOccur ?spTopic.
?spTopic dc:date ?year. schema:description ?TopicWords.
}ORDER BY DESC(?year)
3</p>
    </sec>
    <sec id="sec-3">
      <title>Discussion and Future Work</title>
      <p>This study focuses on capturing the evolution of topics in migrations-related
tweets in the destination countries. It uses the time-aware topic modeling method,
DETM, for achieving this goal. The results are then populated in extended
TweetsKB where the temporal dimension is represented with the help of literals
(date values) for further analysis. As a future study, an extensive manual veri
cation on the chosen topics will be conducted, and the current RDF schema will
be extended with RDF-Star.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alam</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gesese</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rezaie</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sack</surname>
          </string-name>
          , H.:
          <article-title>Migranalytics: Entity-based analytics of migration tweets</article-title>
          .
          <source>CEUR workshop proceedings 2721</source>
          , 74{
          <fpage>78</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>3</volume>
          (
          <issue>null</issue>
          ),
          <volume>993</volume>
          {1022 (Mar
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sack</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alam</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Migrationskb: A knowledge base of public attitudes towards migrations and their driving factors (</article-title>
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dieng</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>F.J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.:</given-names>
          </string-name>
          <article-title>The dynamic embedded topic model</article-title>
          . CoRR abs/
          <year>1907</year>
          .05545 (
          <year>2019</year>
          ), http://arxiv.org/abs/
          <year>1907</year>
          .05545
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dieng</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>F.J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.:</given-names>
          </string-name>
          <article-title>Topic modeling in embedding spaces (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fafalios</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iosi</surname>
            <given-names>dis</given-names>
          </string-name>
          , V.,
          <string-name>
            <surname>Ntoutsi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Tweetskb: A public and large-scale RDF corpus of annotated tweets</article-title>
          . CoRR abs/
          <year>1810</year>
          .10308 (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Wallach</surname>
            ,
            <given-names>H.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murray</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mimno</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Evaluation methods for topic models</article-title>
          .
          <source>In: International Conference on Machine Learning</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>