Introduction

Temporal Evolution of the Migration-related Topics on Social Media?

Yiyi Chen

Genet Asefa Gesese

Harald Sack

Mehwish Alam

0 FIZ Karlsruhe 1 Karlsruhe Institute of Technology, Institute AIFB , Germany 2 Leibniz Institute for Information Infrastructure , Germany

This poster focuses on capturing the temporal evolution of migration-related topics on relevant tweets. It uses Dynamic Embedded Topic Model (DETM) as a learning algorithm to perform a quantitative and qualitative analysis of these emerging topics. TweetsKB is extended with the extracted Twitter dataset along with the results of DETM which considers temporality. These results are then further analyzed and visualized. It reveals that the trajectories of the migration-related topics are in agreement with historical events. The source codes are available online: https://bit.ly/3dN9ICB.

Introduction

Social media has become one of the most widely used channels for people to exchange opinions about social events around the globe. It provides one of the most useful resources about social interactions and trending events on important topics such as migration, climate change, political elections, etc. Over the last decade, migration has become one of the most controversial topics in Europe. This poster analyses the Twitter data from the destination countries in Europe, to gain insight into migration-related events. The analysis of the temporal trajectory of the topics in migration-related tweets shows the shifts in the attention of people over time.

For example, people's interest in \Syrian refugees" peaked in 2016 in Europe, while later it decayed. With the emergence of COVID-19 in 2020, the attention shifted to \border control". By analyzing the co-occurring topic words, the driving factors could be identi ed. As shown in Fig. 2, from 2020 onwards the probability trajectories of co-occurring words \COVID", \migrant", \border" have similar trends, which indicates the pandemic as a causal factor of debates about border control, and further a ecting the migration ows in this period.

Several e orts have been made to provide a Knowledge Base (KB) about tweets to make the data more accessible and usable for further research. One ? This work is a part of ITFLOWS project which has received funding from the EU

H2020 research and innovation program under grant agreement No 882986. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). such e ort is TweetsKB [ 6 ], which is a KB containing 1.5 billion tweets spanning through 5 years, including entity and sentiment annotations. MigrAnalytics [ 1 ] is another such e ort which uses TweetsKB as a starting point to analyze migration-related tweets using entity-based approaches. While both provide time-series data, no attempt has been made to analyze the temporal evolution of migration-related topics, which would help us to identify the changing driving factors of migrations in a temporal dimension. This study completes MigrationsKB (MGKB) [ 3 ] with the tweets including topics evolving through time using advanced methods based on neural networks. 2

Temporal Evolution of Topics in Migrations

Tweet Extraction. Keyword-based methods are used to extract tweets from Twitter. The words \immigration" and \refugee" are used as the seed words which are then enriched with top-50 most similar words with pre-trained Word2Vec and fastText embeddings. The nal set of keywords are then manually veri ed. The collected tweets are from 11 destination countries, where most refugees in Europe are hosted. These countries are selected by ranking them according to the frequency of the asylum seekers obtained from Eurostat3. These countries include the United Kingdom, Germany, Spain, Poland, France, Sweden, Austria, Hungary, Switzerland, Netherlands, and Italy. The nal number of extracted tweets spanning over 7 years (2013 - 2020) is 384891 which are then preprocessed by removing user mentions, reserved words (i.e., RT), emojis, smileys, URLs, stop words, punctuations, numerical tokens, HTML tags. Moreover, the tokens except hashtags are lemmatized. Finally, the tweets with at least two words are retained.

Dynamic Topic Modeling. Topic modeling is used to extract hidden semantics in textual documents. The most widely used algorithm for topic modeling is Latent Dirichlet Allocation (LDA) [ 2 ]. However, it fails in the presence of large vocabularies. Embedded Topic Model (ETM) [ 5 ] aims to solve this problem by employing word embeddings. It de nes each topic as a vector on the word embedding space, and then represents the per-topic proportion as joint information from words and topic embeddings. DETM [ 4 ] is developed to extend ETM, which uses a probabilistic time series to model the topics varying over time. Similar to ETM, the probability of each word under DETM is a categorical distribution whose parameters depend on joint information from word embeddings and per-topic proportion, however, the topic proportions vary over time.

In DETM, the word and the topic embeddings are trained in parallel on the extracted Twitter data with a time variable on a prede ned number of topics. The data is split into the train (85%), validation (5%), and test (10%) (i.e., 306077, 18011, and 35766 tweets respectively) with a vocabulary size of 20865. DETM is optimized on a document completion task [ 7 ] and the best model is used to generate topics from the tweets. In [ 4 ], the authors choose 50 as the number 3 https://ec.europa.eu/eurostat

Temporal Evolution of the Migration-related Topics on Social Media of topics. For the sake of completion, the experiments are also conducted with 25 and 75 topics in this study. However, after applying the pretrained DETM on the extracted Twitter data, many redundant topics are found, i.e., not being classi ed to the tweets. Hence, the models with lower numbers of topics are applied. As shown in Table 1, only models trained with 5 and 10 topics have all the topics classi ed to the tweets, which are used for further analysis.

The trained word embeddings from DETM represent the similarity between the words in the tweets. To select the tweets that are relevant to the topic of migration, the centroid of the word vectors of keywords is clustered with KMeans, and then the cosine similarity between the centroid and the trained topic proportions of tweets is calculated (i.e., the con dence score). The relevant tweets have a con dence score greater than 0. As shown in Fig. 1a, for both models, the scores are normally distributed, while the topic proportions of the tweets are trained under a Gaussian noise [ 4 ]. That means, the distance between the tweets and the centroid has the same distribution as the topic proportions. Therefore, \migration" is the central topic of the tweets.

In Fig. 1b, the distribution of the tweets stays almost the same across time and across topics. However, in Fig. 1c, after ltering the non-relevant tweets, the tweets of the 5th topic are ltered out, while the tweets of the 8th topic are more evenly distributed across time. After the automatic selection, since the model trained with 10 topics provide richer topics across the time, the model is selected for further analysis. Fig. 1d shows the distribution of the non-relevant tweets per country.

Fig. 2 shows the evolution of word probability across time for four di erent speci c topics (i.e., the titles for the plots) learned by DETM. For each, a set of words whose probability shift aligns with historical events are presented. For example, for \Syria, Refugee, and EU", the probability of the words shift in the same manner from 2014 to 2017, re ecting the event of a large amount of Syrian refugees entering the European Union, which peaked from 2015 to 2016. In 2018, the Brexit decisions were made, at the same time, the media outlets claimed a strong connection between Brexit and Trumpism, which is also re ected in \Brexit and Refugee". The discussion of the border has a strong correlation with the emergence of COVID-19 since 2020, as shown in \Migrant and Border".

Extending TweetsKB. The tweets along with the assigned topics are stored in an extension of TweetsKB. Moreover, the words associated with each topic

Fig. 1: Tweets assigned with Topics (a) Distribution of Con dence Score for Models with 5 and 10 Topics (b) The Distribution of Tweets before and after Filtering for Model with 5 Topics (c) The Distribution of Tweets before and after Filtering for Model with 10 Topics

(d) The Distribution of Non/Relevant Tweets per Country across time are also added using new classes and properties, please refer to the GitHub page for more details. This information can be queried using SPARQL. The following query shows the evolution of words of a given topic. SELECT ?TopicWords ?year WHERE{ ?topic sioc:id "7"; mgkb:topicOccur ?spTopic. ?spTopic dc:date ?year. schema:description ?TopicWords. }ORDER BY DESC(?year) 3

Discussion and Future Work

This study focuses on capturing the evolution of topics in migrations-related tweets in the destination countries. It uses the time-aware topic modeling method, DETM, for achieving this goal. The results are then populated in extended TweetsKB where the temporal dimension is represented with the help of literals (date values) for further analysis. As a future study, an extensive manual veri cation on the chosen topics will be conducted, and the current RDF schema will be extended with RDF-Star.

1. Alam , M. , Gesese , G.A. , Rezaie , Z. , Sack , H.: Migranalytics: Entity-based analytics of migration tweets . CEUR workshop proceedings 2721 , 74{ 78 ( 2020 )

2. Blei , D.M. , Ng , A.Y. , Jordan , M.I. : Latent dirichlet allocation . J. Mach. Learn. Res . 3 ( null ), 993 {1022 (Mar 2003 )

3. Chen , Y. , Sack , H. , Alam , M. : Migrationskb: A knowledge base of public attitudes towards migrations and their driving factors ( 2021 )

4. Dieng , A.B. , Ruiz , F.J.R. , Blei , D.M.: The dynamic embedded topic model . CoRR abs/ 1907 .05545 ( 2019 ), http://arxiv.org/abs/ 1907 .05545

5. Dieng , A.B. , Ruiz , F.J.R. , Blei , D.M.:

Topic modeling in embedding spaces (

2019 )

6. Fafalios , P. , Iosi

dis

, V., Ntoutsi , E. , Dietze , S. : Tweetskb: A public and large-scale RDF corpus of annotated tweets . CoRR abs/ 1810 .10308 ( 2018 )

7. Wallach , H.M. , Murray , I. , Salakhutdinov , R. , Mimno , D. : Evaluation methods for topic models . In: International Conference on Machine Learning ( 2009 )