Temporal Evolution of the Migration-related Topics on Social Media? Yiyi Chen, Genet Asefa Gesese, Harald Sack, and Mehwish Alam 1 FIZ Karlsruhe – Leibniz Institute for Information Infrastructure, Germany 2 Karlsruhe Institute of Technology, Institute AIFB, Germany firstname.lastname@fiz-karlsruhe.de Abstract. This poster focuses on capturing the temporal evolution of migration-related topics on relevant tweets. It uses Dynamic Embedded Topic Model (DETM) as a learning algorithm to perform a quantitative and qualitative analysis of these emerging topics. TweetsKB is extended with the extracted Twitter dataset along with the results of DETM which considers temporality. These results are then further analyzed and visual- ized. It reveals that the trajectories of the migration-related topics are in agreement with historical events. The source codes are available online: https://bit.ly/3dN9ICB. 1 Introduction Social media has become one of the most widely used channels for people to exchange opinions about social events around the globe. It provides one of the most useful resources about social interactions and trending events on important topics such as migration, climate change, political elections, etc. Over the last decade, migration has become one of the most controversial topics in Europe. This poster analyses the Twitter data from the destination countries in Europe, to gain insight into migration-related events. The analysis of the temporal tra- jectory of the topics in migration-related tweets shows the shifts in the attention of people over time. For example, people’s interest in “Syrian refugees” peaked in 2016 in Europe, while later it decayed. With the emergence of COVID-19 in 2020, the attention shifted to “border control”. By analyzing the co-occurring topic words, the driv- ing factors could be identified. As shown in Fig. 2, from 2020 onwards the prob- ability trajectories of co-occurring words “COVID”, “migrant”, “border” have similar trends, which indicates the pandemic as a causal factor of debates about border control, and further affecting the migration flows in this period. Several efforts have been made to provide a Knowledge Base (KB) about tweets to make the data more accessible and usable for further research. One ? This work is a part of ITFLOWS project which has received funding from the EU H2020 research and innovation program under grant agreement No 882986. Copyright © 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 Chen et al. such effort is TweetsKB [6], which is a KB containing 1.5 billion tweets span- ning through 5 years, including entity and sentiment annotations. MigrAnalyt- ics [1] is another such effort which uses TweetsKB as a starting point to ana- lyze migration-related tweets using entity-based approaches. While both provide time-series data, no attempt has been made to analyze the temporal evolution of migration-related topics, which would help us to identify the changing driving factors of migrations in a temporal dimension. This study completes Migra- tionsKB (MGKB) [3] with the tweets including topics evolving through time using advanced methods based on neural networks. 2 Temporal Evolution of Topics in Migrations Tweet Extraction. Keyword-based methods are used to extract tweets from Twitter. The words “immigration” and “refugee” are used as the seed words which are then enriched with top-50 most similar words with pre-trained Word2Vec and fastText embeddings. The final set of keywords are then manually verified. The collected tweets are from 11 destination countries, where most refugees in Europe are hosted. These countries are selected by ranking them according to the frequency of the asylum seekers obtained from Eurostat3 . These countries in- clude the United Kingdom, Germany, Spain, Poland, France, Sweden, Austria, Hungary, Switzerland, Netherlands, and Italy. The final number of extracted tweets spanning over 7 years (2013 - 2020) is 384891 which are then prepro- cessed by removing user mentions, reserved words (i.e., RT), emojis, smileys, URLs, stop words, punctuations, numerical tokens, HTML tags. Moreover, the tokens except hashtags are lemmatized. Finally, the tweets with at least two words are retained. Dynamic Topic Modeling. Topic modeling is used to extract hidden seman- tics in textual documents. The most widely used algorithm for topic modeling is Latent Dirichlet Allocation (LDA) [2]. However, it fails in the presence of large vocabularies. Embedded Topic Model (ETM) [5] aims to solve this problem by employing word embeddings. It defines each topic as a vector on the word em- bedding space, and then represents the per-topic proportion as joint information from words and topic embeddings. DETM [4] is developed to extend ETM, which uses a probabilistic time series to model the topics varying over time. Similar to ETM, the probability of each word under DETM is a categorical distribu- tion whose parameters depend on joint information from word embeddings and per-topic proportion, however, the topic proportions vary over time. In DETM, the word and the topic embeddings are trained in parallel on the extracted Twitter data with a time variable on a predefined number of topics. The data is split into the train (85%), validation (5%), and test (10%) (i.e., 306077, 18011, and 35766 tweets respectively) with a vocabulary size of 20865. DETM is optimized on a document completion task [7] and the best model is used to generate topics from the tweets. In [4], the authors choose 50 as the number 3 https://ec.europa.eu/eurostat Temporal Evolution of the Migration-related Topics on Social Media 3 of topics. For the sake of completion, the experiments are also conducted with 25 and 75 topics in this study. However, after applying the pretrained DETM on the extracted Twitter data, many redundant topics are found, i.e., not being classified to the tweets. Hence, the models with lower numbers of topics are applied. As shown in Table 1, only models trained with 5 and 10 topics have all the topics classified to the tweets, which are used for further analysis. Table 1: The Results of DETM with Different Number of Topics Nr. Topics 5 10 15 20 25 50 75 Val PPL 2938.2 2757.5 2736.8 2702.1 2698.4 2638.9 2741.7 Test PPL 2760.7 2706.1 2638.9 2594.8 2647.8 2637.8 2740.8 Topic Diversity 0.773 0.7545 0.6953 0.6598 0.6666 0.6951 0.6105 Topic Coherence 0.3724 0.2533 0.2453 0.2030 0.1913 0.0619 0.1299 Topic Quality 0.2878 0.1911 0.1706 0.1340 0.1275 0.0430 0.0793 Classified Nr. of Topics 5 10 12 14 14 32 37 The trained word embeddings from DETM represent the similarity between the words in the tweets. To select the tweets that are relevant to the topic of migration, the centroid of the word vectors of keywords is clustered with K- Means, and then the cosine similarity between the centroid and the trained topic proportions of tweets is calculated (i.e., the confidence score). The relevant tweets have a confidence score greater than 0. As shown in Fig. 1a, for both models, the scores are normally distributed, while the topic proportions of the tweets are trained under a Gaussian noise [4]. That means, the distance between the tweets and the centroid has the same distribution as the topic proportions. Therefore, “migration” is the central topic of the tweets. In Fig. 1b, the distribution of the tweets stays almost the same across time and across topics. However, in Fig. 1c, after filtering the non-relevant tweets, the tweets of the 5th topic are filtered out, while the tweets of the 8th topic are more evenly distributed across time. After the automatic selection, since the model trained with 10 topics provide richer topics across the time, the model is selected for further analysis. Fig. 1d shows the distribution of the non-relevant tweets per country. Fig. 2 shows the evolution of word probability across time for four different specific topics (i.e., the titles for the plots) learned by DETM. For each, a set of words whose probability shift aligns with historical events are presented. For example, for “Syria, Refugee, and EU”, the probability of the words shift in the same manner from 2014 to 2017, reflecting the event of a large amount of Syrian refugees entering the European Union, which peaked from 2015 to 2016. In 2018, the Brexit decisions were made, at the same time, the media outlets claimed a strong connection between Brexit and Trumpism, which is also reflected in “Brexit and Refugee”. The discussion of the border has a strong correlation with the emergence of COVID-19 since 2020, as shown in “Migrant and Border”. Extending TweetsKB. The tweets along with the assigned topics are stored in an extension of TweetsKB. Moreover, the words associated with each topic 4 Chen et al. Fig. 1: Tweets assigned with Topics (a) Distribution of Confidence Score for Models with 5 and 10 Topics (b) The Distribution of Tweets before and after Filtering for Model with 5 Topics (c) The Distribution of Tweets before and after Filtering for Model with 10 Topics (d) The Distribution of Non/Relevant Tweets per Country Temporal Evolution of the Migration-related Topics on Social Media 5 Fig. 2: Evolution of Word Probability Across Time for different Topics across time are also added using new classes and properties, please refer to the GitHub page for more details. This information can be queried using SPARQL. The following query shows the evolution of words of a given topic. SELECT ?TopicWords ?year WHERE{ ?topic sioc:id "7"; mgkb:topicOccur ?spTopic. ?spTopic dc:date ?year. schema:description ?TopicWords. }ORDER BY DESC(?year) 3 Discussion and Future Work This study focuses on capturing the evolution of topics in migrations-related tweets in the destination countries. It uses the time-aware topic modeling method, DETM, for achieving this goal. The results are then populated in extended TweetsKB where the temporal dimension is represented with the help of literals (date values) for further analysis. As a future study, an extensive manual verifi- cation on the chosen topics will be conducted, and the current RDF schema will be extended with RDF-Star. References 1. Alam, M., Gesese, G.A., Rezaie, Z., Sack, H.: Migranalytics: Entity-based analytics of migration tweets. CEUR workshop proceedings 2721, 74–78 (2020) 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(null), 993–1022 (Mar 2003) 3. Chen, Y., Sack, H., Alam, M.: Migrationskb: A knowledge base of public attitudes towards migrations and their driving factors (2021) 4. Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: The dynamic embedded topic model. CoRR abs/1907.05545 (2019), http://arxiv.org/abs/1907.05545 5. Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces (2019) 6. Fafalios, P., Iosifidis, V., Ntoutsi, E., Dietze, S.: Tweetskb: A public and large-scale RDF corpus of annotated tweets. CoRR abs/1810.10308 (2018) 7. Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: International Conference on Machine Learning (2009)