1. Introduction

R. M. M. Hicke); martonkardos@cas.au.dk(M. Kardos); mettethunoe@cas.au.dk (M. Thunø) ç https://rmatouschekh.github.io(R. M. M. Hicke) ȉ

Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media

Ross Deans Kristensen-McLachlan

0 3

Rebecca M. M. Hicke

0 1

Márton Kardos

Mette Thunø

2 0 Center for Humanities Computing, Aarhus University , Denmark 1 Department of Computer Science, Cornell University , USA 2 Department of Global Studies, Aarhus University , Denmark 3 Department of Linguistics, Cognitive Science, and Semiotics, Aarhus University , Denmark

2024

000 0 0001

Does the People's Republic of China (PRC) interfere with European elections through ethnic Chinese diaspora media? This question forms the basis of an ongoing research project exploring how PRC narratives about European elections are represented in Chinese diaspora media, and thus the objectives of PRC news media manipulation. In order to study diaspora media efÏciently and at scale, it is necessary to use techniques derived from quantitative text analysis, such as topic modelling. In this paper, we present a pipeline for studying information dynamics in Chinese media. Firstly, we present KeyNMF, a new approach to static and dynamic topic modelling using transformer-based contextual embedding models. We provide benchmark evaluations to demonstrate that our approach is competitive on a number of Chinese datasets and metrics. Secondly, we integrate KeyNMF with existing methods for describing information dynamics in complex systems. We apply this pipeline to data from five news sites, focusing on the period of time leading up to the 2024 European parliamentary elections. Our methods and results demonstrate the efectiveness of KeyNMF for studying information dynamics in Chinese media and lay groundwork for further work addressing the broader research questions.

eol>keywords novelty contextual topic models Chinese information dynamics

1. Introduction

Much digital ink is spilled on these topics in Western media as the various electorates determine their preferences before elections and digest the fallout afterwards. Moreover, a significant part of this media coverage is fundamentally persuasive, aiming to convince voters to bet on the candidate who most closely aligns with the social and economic ideology of the media outlets and their owners [ 13 ]. Likewise, coverage of these elections is not limited to European media institutions, with media outlets around the world updating their readership on how these elections impact them.

In this context, one particular type of media stands out as especially interesting: ethnic Chinese media targeting diaspora communities in Europe, a group which by some estimates comprises around 1.5-3 million individuals. These media outlets are potentially invaluable sources for understanding how the Chinese government and the Chinese Communist Party (CCP) attempt to influence the diaspora. Furthermore, studying these outlets potentially provides unique insights into how China views itself in relation to the West by showing how the PRC presents itself to its diaspora groups. A growing body of literature has already begun to address these questions in the context of social media2[8, 29] or in terms of digital infrastructure more generally [ 10, 11 ]. In ongoing research, our aim is to assess whether Chinese diaspora news sources intend to impact opinions on elections in the West during 2024. We attempt to understand the control of information flow in Chinese diaspora media and how this control is used to set specific agendas during electoral periods: promoting certain political parties or individual candidates, polarizing citizens, and attacking or promoting specific political positions.

To pursue this research, we design a pipeline for analyzing large amounts of Chineselanguage news data. First, we introduce KeyNMF, a novel approach to creating contextsensitive topics models via transformer-based encoder models. KeyNMF can be trivially applied across diferent languages and in data scarce environments, and is shown here to create coherent, human-interpretable outputs when working with Chinese language data. We then integrate KeyNMF with existing techniques for describing the information dynamics of complex systems which measure the novelty and resonance of information present in a system over time. We use this pipeline to perform preliminary analysis on our dataset of Chinese diaspora media, finding clear trends in the novelty and resonance signals which correlate with significant political events. The results presented are thus intended to be both a proof of concept and a stepping stone towards more meaningful understanding of the dynamics underlying Chinese diaspora media.

2. Related Work 2.1. Information Dynamics

The study of information dynamics in complex cultural systems has been a central aspect of research in computational humanities and cultural analytics in recent years. One of the most promising approaches to this problem was introduced in 3[] which studied the shifting debates which took place during the French Revolution. In this approach, divergence in content between diferent time slices can be calculated using information-theoretic measures. These measures can then be used to quantify two interrelated values: thenovelty of the system, or how much the new time slice diverges from preceding time slices; and theresonance of this information, which describes how information persists over time.

Novelty-resonance patterns have been studied in a number of diferent discourse domains. [ 21 ] demonstrate their usefulness in identifying so-called trend reservoirs on Reddit. Similar interaction patterns between novelty and resonance have been successfully employed to study the manner in which online news media responded to catastrophic events1[ 8, 20, 19 ]. In [31], the same fundamental method of analysis demonstrates that novelty-resonance patterns clearly track major social and historical events in the 20th century, using data taken from the front page of Dutch newspapers.

Calculating these underlying dynamics requires the creation of some kind of numerical representation of the data. Specifically, the diference between individual windows is computed by finding the windowed relative entropy, in this case calculated using Jensen-Shannon Divergence (JSD). Since JSD computes the distance between probability distributions, the numerical representations of the data are required to take that form. In 2[], this was achieved by calculating the probabilities of a pre-trained, BERT-based emotion classification model, where the predicted probabilities for each label created a distribution over emotions for each document. However, for most purposes, novelty and resonance are calculated based on distributions generated by a probabilistic topic model.

2.2. Vanilla LDA

Typically, novelty and resonance are calculated from topic probability distributions extracted by Latent Dirichlet Allocation (LDA) [9, 7]. Topic distributions in documents are a natural choice for information dynamics, as they are immediately usable with entropy-based measures. LDA is a generative bag-of-words model, which assumes that a document contains a mixture of topics and all words in the document are drawn from this mixture distribution.

However, LDA has a number of well-known shortcomings. Documents have to be heavily pre-processed for optimal results; otherwise, the topic descriptions produced by the model are often contaminated by noise and stop words [16]. In addition, since LDA makes the bag-ofwords assumption, it cannot utilize contextual and syntactic information, nor general properties of natural language learned from outside sources. Finally, LDA is sensitive to hyperparameter choices and Wallach, Mimno, and McCallum [30] demonstrate that using symmetric Dirichlet priors, which is the case in canonical implementations 2[ 5, 22 ] and the majority of academic studies, can lead to sub-optimal performance.

There have also been challenges to the generalizability of LDA from the perspective of Chinese NLP, as the primary structural and semantic unit of Chinese is the character rather than the word [ 32, 24 ]. While these concerns might be overstated, working with Chinese language data causes specific challenges in terms of tokenization and semantics which directly impact the efÏcacy of traditional LDA approaches to topic modelling.

2.3. Alternatives to LDA

A major shortcoming of LDA when trying to model change over time is that topics are calculated over all documents, essentially flattening any temporal aspect of the data. This is undesirable, since topics themselves naturally evolve over time, meaning that LDA may not reflect the true dynamics of a system. These issue is partly rectified by dynamic topic models [8] which account for temporal changes in topics with a state-space model. However, Dynamic LDA models are even more parameter-rich than the vanilla implementation and thus amplify its limitations.

Recently, contemporary topic models have shown that it is possible to utilize embeddings from the sentence transformers [27] to infuse contextual information into topic models and to allow for transfer learning [ 5, 6, 14, 1, 16 ]. This contextual information can lead to more coherent and semantically interpretable topics. In addition, since these models draw on existing pre-trained language models, they do not require training a generative model from scratch. This means that it is possible to train topic models in data scarce contexts where traditional LDA might perform poorly.

Among the most popular of these contemporary models is BERTopic1[ 4 ], which also has dynamic modelling capabilities. In this model, topic-term importances are estimated post-hoc on pre-defined time slices based on one underlying topic model. However, as with LDA, BERTopic is sensitive to pre-processing [16]. Additionally, because BERTopic is a clustering topic model, documents are only assigned a single topic label. This renders the model impractical in settings where documents are expected to contain multiple topics and means that BERTopic is not suitable for calculating novelty and resonance, since the entropy calculations assume probability distributions over documents.

3. KeyNMF

We propose KeyNMF, a novel topic modelling approach that utilizes neural text embeddings. KeyNMF builds on the reliability, stability 4[], scalability [ 17 ], and interpretability of Nonnegative Matrix Factorization (NMF) [ 12 ], while mitigating its sensitivity to pre-processing and making use of contextual information in texts. This is achieved by: 1) computing keyword importances from documents with contextual embeddings (similar to KeyBERT1[ 5 ]); and 2) decomposing those importances with NMF.

We release an implementation of KeyNMF as part of theTurftopic Python package.1

3.1. Model Description

KeyNMF operationalizes topic extraction as the following steps: 1. For each document : a) Let be the document’s embedding produced with an encoder model. b) Let be the word embedding of a word produced with the same encoder model. c) Let be the set of keywords in with the highest cosine similarity to : = arg max ∗ ∑ sim( , ), where | | = and ∈ ∈ ∗ the importance of keyword in document : 2. Arrange the keyword similarities into a non-negative keyword matri x . Let be = {sim(, ), 0, if ∈ and sim( , ) > 0 otherwise. descent, minimizing the square loss( , ) = || − || 2 .

3. Decompose with non-negative matrix factorization: ≈ , where

is the document-topic matrix, and is the topic-term-matrix. This is achieved with coordinate

3.2. Dynamic KeyNMF

KeyNMF can be used for modelling topics’ evolution in a corpus over time. This is done by ifrst computing a global model over the entire corpus, then calculating time-specific topic-term importances in predefined time slices. Specifically: 1. Compute the keyword matrix for the whole corpus. 2. Decompose with non-negative matrix factorization: ≈ . 3. For each time slice : b) Obtain the topic-term-matrix for with NMF while fixing : a) Let be a subset of and a subset of for the documents in time slice .

∗ = arg min || − ∗||2 by L1-normalizing the temporal importances: ̂ = ∑ . c) The temporal importance of topic is then = ∑∈ ( ) , where all are documents in time slice . We can obtain pseudo-topic distributions in the time-slices Since NMF is not a probabilistic model, we use temporal pseudo-probabilities as a proxy for topic distributions.

3.3. Performance

To demonstrate KeyNMF’s efectiveness as a topic model, we evaluate its performance using the topic-benchmark Python package and the paraphrase-multilingual-MiniLM-L12-v22 embedding model. 15 keywords are extracted for each document. Our evaluation procedure is based on that of Kardos, Kostkan, Vermillet, Nielbo, Enevoldsen, and Rocca1[ 6 ], but, since our intended use case is Chinese news data, we ran the benchmark using the same corpora and pipeline as in our investigations (see Sections4 and 5). Additionally, we utilized paraphrasemultilingual-MiniLM for measuring external word embedding coherence, instead of an English

Word2Vec model. 3

the metric. scores on Top2Vec should thus be interpreted with caution. 2https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 3This gives Top2Vec an unfair advantage on this metric as it selects descriptive words based on the same criteria as evaluated on diversity ( ), internal ( ) and external ( ) word embedding coherence. KeyNMF’s performance on Chinese news data against a number of baselines. Topic descriptions were Top2Vec on most corpora, which explicitly selects words based on their proximity in semantic space (see Table 1). The model represents a drastic improvement over classical topic models outperforming both NMF and LDA significantly, indicating that the contextual information infused into the model enhances its performance in a meaningful way. 3.3.1. Sensitivity to Number of Keywords We additionally test whether the number of keywords extracted from a text influences the model’s performance on diferent corpora, which allows us to determine KeyNMF’s robustness to hyperparameter choices. We used the same news sources, pipeline, and quantitative metrics for evaluating this property of the model as for previous evaluations and analyses. The number of keywords was varied from 5 to 100 with a step size of 5 (see Figure1).

We observed that performance was relatively stable regardless of number of keywords, and converged rather quickly. Only minimal fluctuations are observable with > 25 pora. However, on Xinozhou and Yidali-Huarenjie, lower values o f (5-15) resulted in higher coherence scores. We thus deem 15 keywords a balanced choice of for further investigations. on most cor

Total and Unique Articles Collected by Site Xinouzhou

5,905

4. Data

# New Articles at Each Time Point

Xinouzhou

Oushinet Yidali Huarenjie

Chinanews

Ihuawen

Date the amount of boilerplate text (e.g. bylines and publication dates) included in the extracted texts; although it is impossible to remove all such text from our dataset, a hand analysis of ten random articles from each news site indicates that the amount of ‘junk’ text included in the ifnal dataset is minimal.

The total and unique number of articles collected from each site are reported in Figure2. It is clear that diferent sites follow diferent publication patterns. To further validate this, we examine the number of ‘new’ articles at each time point for each source, or the number of articles that were not included in the last scrape (Figure3). We see that some sites, like Xinouzhou and Yidali Huarenjie, frequently refresh the articles displayed on their main pages, leading to a larger number of unique articles. In contrast, sites like Ihuawen appear to keep several articles on the main pages for a long time, meaning that they display a very small number of unique articles overall. These diferences likely afect the patterns we see in the information systems for each source.

5. Experimental Design

Extracted article texts are embedded with a multilingual transformer-based model2[ 6 ]9 using the Sentence Transformers library.10 The embedding is done entirely on a 64-core CPU with 384GB RAM. Each document is embedded once for each time it appears in the dataset. In total, embedding all the documents takes∼2 hours. The maximum sequence length of this embedding model is 128 tokens. Thus, any article longer than 128 tokens is truncated and information from later in the piece is not included in the embedding. Although this is a limitation, we do not consider it prohibitive, as previous research has shown that the bulk of the content in a 9paraphrase-multilingual-MiniLM-L12-v2 10https://sbert.net news article is presented at the very beginning — a widely-practiced professional standard for journalistic writing known as theinverted pyramid [ 23 ].

Since our primary interest is understanding the evolution of information dynamics in each news site over time, we use Dynamic KeyNMF to find topic proportions for each timeslice. For keyword extraction, we utilize thejieba tokenizer and remove stop-words present in an authoritative list,11 with the retained tokens then encoded using the same multilingual model as was used on the documents [26]. We fit multiple models with 10, 25, and 50 topics respectively in order to investigate topical dynamics at multiple levels of granularity. Separate models are fit for each news site. The plotted topics over time, top keywords for each topic at each timeslice, and topic distributions at each timeslice are extracted from each model and saved for further analysis.

We then use the topic pseudo-distributions to measure the novelty and resonance signals for each news site and, following [ 20 ] and [2], use windowed relative entropy with Jensen-Shannon divergence to calculate both metrics. For a window of siz e , the novelty at time point is the mean entropy of the topic pseudo-distribution at ( ̂ ) and the previous pseudo-distributions. The transience at time point is the mean entropy of the topic pseudo-distribution at and the subsequent pseudo-distributions. Then, the resonance of a time point is the novelty at that point minus the transience. We use a window of size 12 when calculating both signals, which is equivalent to three days of data.

We apply nonlinear adaptive filtering to smooth the extracted novelty and resonance, again following [ 20 ] and [2]. This removes noise from the signals by calculating the value at a given time point relative to the surrounding time points. We use a span of 56, the same as2[], for smoothing. The code we use for calculating novelty and resonance is adapted from that released alongside [2] and [ 20 ].

6. Results and Discussion

We find clear trends in the novelty and resonance signals that correlate to significant events in the EU during the period studied: Xi Jinping’s European Tour (May 5-10), Putin’s state visit to China (May 16-17), and the EU parliamentary elections (June 6-9). Our analysis focuses on the novelty and resonance trends extracted from the KeyNMF models with ten topics as these provide the clearest signals. The results for 25 and 50 topics are included in AppendixC.1. We additionally focus our in depth discussion of the results on the two largest news sources, Xinouzhou and Oushinet, for this preliminary validation of the pipeline.

We see spikes in novelty of varying strengths for both Xinouzhou and Oushinet during Xi Jinping’s European tour (Figure4). There are also corresponding dips in resonance before his tour for both sites, followed by increases in resonance during the tour. This indicates that novel information is introduced to the site ecosystems during the tour which replaces previous topics of interest, and which persists in the system for some time.

One of the most productive aspects of Dynamic KeyNMF is that it allows us to study topic lfuctuations over time. Thus, we explore which topical shifts contribute to changes in the novelty and resonance signals. For example, on Oushinet, the time period during Xi Jinping’s 11https://github.com/stopwords-iso/stopwords-zh/blob/master/stopwords-zh.txt

European tour is associated with high pseudo-probabilities for a topic defined by the keywords Paris, France and state visit and a topic defined by President, China, and Xi Jinping (Appendix C.2, Figure 9). Towards the end of the tour, a topic on diplomacy andbilaterial relations between China and France also gains prominence. For Xinouzhou, this time period contains a peak in the pseudo-probabilities for two topics on Hungary and Chinese relations with Hungary, one of the locations on the tour.

Similarly, there is a noticeable spike in the novelty and resonance for Oushinet directly before Putin’s state visit to China. This period is marked by relatively high pseudo-probabilities for a topic characterized by the termsChina, Beijing, Chinese, and Chinese News Service and a topic with the keywords Russia, Ukraine, Putin, and Moscow (Appendix C.2, Figure 7).

Most significantly for this study, there are fluctuations in novelty and resonance for both sites around the EU parliamentary elections. Specifically, there are peaks in the novelty and resonance signals for Xinouzhou and Oushinet before and after the elections, with troughs throughout much of the election period. We hypothesize that these trends reflect a focus on election-related news which begins in early June and continues through the elections and then an introduction of new topics after their end. Again examining the topic distributions, we see that for Oushinet the period before and during the election is marked by high pseudoprobabilities for two topics directly related to the parliamentary elections, one topic surrounding the Spanish prime minister, and two on Russia and Ukraine and the Israel-Palestine war (Appendix C.2, Figure 8). Interestingly, pseudo-probabilities for the topic most directly focused on the elections continued to grow even after the election, suggesting that Oushinet was still discussing the election results during this time. Similarly, for Xinouzhou, three topics focused on the UK elections, Europe broadly, and the Spanish prime minister were comparatively prominent towards the end of May and beginning of June.

Overall, we find that this pipeline allows us to efectively locate changes in news ecosystems, correlate these changes to political and cultural events of interest, and further explore possible reasons for these changes via topic models. It reveals diferences in media responses both between events and between sites, while also demonstrating the similarities in sites’ news ecosystems, such as the increased discussion of the Spanish prime minister on both Xinouzhou and Oushinet before the EU parliamentary elections. We believe that the combination of the novelty and resonance metrics with the novel KeyNMF topic model will permit further in-depth analysis of these media sites and facilitate research on other Chinese-language domains.

7. Conclusion

In this paper, we present a pipeline designed to facilitate research on the underlying information dynamics of Chinese diaspora media published in Europe. This pipeline combines existing information-theoretic methods that model how new information enters and persists in systems with a novel topic model, KeyNMF. KeyNMF overcomes some of the weaknesses of previous traditional and contextual topic models, demonstrating high performance on standard benchmarks. We validate this pipeline through preliminary experimentation on our dataset of Chinese diaspora media, finding that it reveals informational trends that correlate with major, newsworthy events in European politics and allows for further analysis of the topical changes that cause those trends. While further qualitative research is required to fully understand these dynamics, we believe that we have presented a major step forward in terms of context-sensitive and interpretable topic modelling and information dynamics which can generalize to multilingual and data scarce environments.

Acknowledgments

Part of the computation done for this project was performed on the UCloud interactive HPC system, which is managed by the eScience Center at the University of Southern Denmark.

A. News Site Subpages

The subpages scraped for each news site are listed below: • Xinouzhou: France, Italy, Spain, UK, Germany, Hungary, International • Ihuawen: News, Comments & Opinions • Oushinet: Europe, Europe: Germany, Europe: Central and Eastern Europe, Europe: Italy, Europe: Spain, Europe: Other, France, Europe and China, Overseas Chinese community, China, International, Opinion on public afairs • Chinanews: Nordic headlines, China news, Mutual learning among civilizations, Overseas Chinese community, Nordic Commercial Bridge, Overseas thoughts • Yidali Huarenjie: ∅

B. NPMI Coherence

Since NPMI Coherence has historical significance in topic modeling literature, we also evaluated topic descriptions with this metrics. Due to theoretical and practical limitation1s6[], Model BERTopic CombinedTM the sake of completeness, we report NPMI scores in Table 2. however, we do not consider NPMI Coherence a good metric for evaluating topic models. For

NPMI coherence of diferent topic models on the studied corpora. oushinet xinozhou

C. Additional Experimental Results C.1. Novelty and Resonance Ablations

Novelty

Resonance 0.0100 0.0075 0.0050 0.0025 0.0000 0.0 2024-05-021024-05-025024-05-029024-05-123024-05-127024-05-221024-05-225024-05-229024-06-022024-06-026024-06-120024-06-124024-06-18

Xinouzhou Oushinet 0.15 0.10 0.10

Xinouzhou Oushinet Chinanews Ihuawen Yidali Huarenjie Figure 6: The novelty and resonance plots for each news site from KeyNMF with 50 topics. The three shaded areas represent Xi Jinping’s European tour (May 5-10, 2024), Putin’s state visit to China (May 16-17, 2024), and the EU parliamentary elections (June 6-9, 2024). Note that the y-axis ranges difer for each chart.

C.2. Topic Distributions Over Time

Oushinet: China, Beijing, Chinese, China News Service ( , , ,

) Oushinet: Russia, Ukraine, Putin, Moscow ( , , , ) Oushinet: elections, political parties, voting, European Parliament ( , , , , , , ) ) Oushinet: Spain, prime minister, Madrid, resign (

, Oushinet: Israel, Gaza, Palestine, Palestinians ( , ,

) Oushinet: Russia, Ukraine, Putin, Moscow ( , , ,

) Xinouzhou: elections, UK, political parties, politics ( , , ,

) Xinouzhou: Europe, Germany, France, Paris ( , , ,

) Xinouzhou: Spain, Prime Minister, Madrid, Spanish government ( , , , ) 2024-05-01 2024-05-05 2024-05-09 2024-05-13 2024-05-17 2024-05-21 2024-05-25 2024-05-29 2024-06-02 2024-06-06 2024-06-10 2024-06-14 2024-06-18 Figure 8: The distributions over time for eight topics with high pseudo-probabilities around the EU parliamentary elections. These topics are generated by the 10-topic KeyNMF models for Oushinet and Xinouzhou. Note that the y-axis scale difers for each subplot. 0.12 0.10 0.08 0.06 0.10 0.08 0.06 0.04 0.06 0.05 0.04 0.050 0.045 0.040 0.035

[1]

Angelov . Top2Vec: Distributed Representations of Topics . 2020 . arXiv: 2008 . 09470 [cs .CL].

[4] [7] [8] [9] [2]

R. B.

Baglini ,

S. M.

Østergaard ,

S. N.

Larsen , and

K. L.

Nielbo . “Emodynamics: Detecting and Characterizing Pandemic Sentiment Change Points on Danish Twitter” . InP:roceedings of the Fourth Conference on Computational Humanities Research , CHR 2022 . Antwerp, Belgium, 2022 , pp. 162 - 176 .

A. T. J.

Barron ,

Huang ,

R. L.

Spang , and S. DeDeo. “Individuals, Institutions, and Innovation in the Debates of the French Revolution” . In:Proceedings of the National Academy of Sciences 115.18 ( 2018 ), pp. 4607 - 4612 . doi: 10 .1073/pnas.1717729115.

Belford ,

B. Mac

Namee , and

Greene . “ Stability of Topic Modeling via Matrix Factorization” . In:Expert Systems With Applications 91.1 ( 2018 ), pp. 159 - 169 . doi: 10 .1016/j .eswa. 2017 . 08 .047.

[5]

Bianchi ,

Terragni , and

Hovy . “ Pre-Training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence” . In:Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2 : Short

Papers).

Online , 2021 , pp. 759 - 766 . doi: 10 .18653/v1/ 2021 .acl-short. 96 .

[6]

Bianchi ,

Terragni ,

Hovy ,

Nozza , and E. Fersini. “ Cross-lingual Contextualized Topic Models with Zero-shot Learning” . In:Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online , 2021 , pp. 1676 - 1683 . doi: 10 .18653/v1/ 2021 .eacl-main. 143 .

D. M. Blei . “ Probabilistic Topic Models” . In:Communications of the ACM 55.4 ( 2012 ), pp. 77 - 84 . doi: 10 .1145/2133806.2133826.

D. M. Blei and J. D.

Laferty . “ Dynamic Topic Models” . In: Proceedings of the 23rd International Conference on Machine Learning . Pittsburgh, Pennsylvania, USA, 2006 , pp. 113 - 120 . doi: 10 .1145/1143844.1143859.

D. M. Blei , A. Y.

Ng , and M. I. Jordan. “Latent Dirichlet Allocation”. InJ:ournal of Machine Learning Research 3.1 ( 2003 ), pp. 993 - 1022 . doi: 10 .5555/944919.944937.

[10]

Brussee . “ Authoritarian Design: How the Digital Architecture on China's Sina Weibo Facilitate Information Control” . In:Asiascape: Digital Asia 9.3 ( 2022 ), pp. 207 - 241 . doi: 10 .1163/ 22142312 - bja10033 .

[11]

Chan and

Alden . “< Redirecting> the Diaspora: China's United Front Work and the Hyperlink Networks of Diasporic Chinese Websites in Cyberspace” . InP:olitical Research Exchange 5 .1 ( 2023 ), pp. 1 - 21 . doi: 10 .1080/2474736x. 2023 . 2179409 .

[12]

Cichocki and

A.-H.

Phan . “ Fast Local Algorithms for Large Scale Nonnegative Matrix and Tensor Factorizations” . In:IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E92.a.3 ( 2009 ), pp. 708 - 721 . doi: 10 .1587/transfun.E9 2. A. 708 .

[13]

Gatterman , T. M. Meyer, and

Wurzer . “ Who Won the Election? Explaining News Coverage of Election Results in Multi-Party Systems” . In:European Journal of Political Research 61.4 ( 2022 ), pp. 857 - 877 . doi: 10 .1111/ 1475 - 6765 . 12498 .

2022. doi: 10 .48550/arXiv.2203.05794. arXiv: 2203 .05794 [cs.CL].

Grootendorst . KeyBERT: Minimal Keyword Extraction with BERT . Version v0.3.0 . 2020 .

doi: 10 .5281/zenodo.4461265.

Kardos ,

Kostkan ,

A.-Q.

Vermillet ,

Nielbo ,

Enevoldsen , and R. Rocca . 3 - Semantic

Signal

Separation . 2024 . doi: 10 .48550/arXiv.2406.09556. arXiv: 2406 .09556 [cs.LG].

[17]

Lefèvre ,

Bach , and

Févotte . “ Online Algorithms for Nonnegative Matrix Factorization with the Itakura-Saito Divergence” . In:2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) . New Paltz, NY, USA, 2011 , pp. 313 - 316 . doi: 10 .1109/aspaa. 2011 . 6082314 .

[18] K. L. Nielbo , R. B.

Baglini , P. B.

Vahlstrup , K. C.

Enevoldsen , A.

Bechmann , and

Roepstorf . “News Information Decoupling: An Information Signature of Catastrophes in Legacy News Media” . In: Proceedings of the 2020 European Association for Digital Humanities Conference. Krasnoyarsk, Russia , 2021 , pp. 1 - 8 . doi: 10 .48550/arXiv.2101.02956.

[19] K. L. Nielbo , K.

Enevoldsen , R.

Baglini , E.

Fano , A.

Roepstorf , and J.

Gao . “ Pandemic News Information Uncertainty -News Dynamics Mirror Diferential Response Strategies to COVID- 19 ”. In: Plos One 18.1 ( 2023 ), e0278098 . doi: 10 .1371/journal.pone. 0278098 .

[20] K. L. Nielbo , F.

Haestrup , K. C.

Enevoldsen , P. B.

Vahlstrup , R. B.

Baglini , and

Roepstorf . When No News is Bad News - Detection of Negative Events from News Media Content . 2021 . doi: 10 .48550/arXiv.2102.06505. arXiv: 2102 .06505 [cs.CY].

[21] K. L. Nielbo , P. B.

Vahlstrup , A.

Bechmann , and J.

Gao . “ Trend Reservoir Detection: Minimal Persistence and Resonant Behavior of Trends in Social Media” . InP: roceedings of the Workshop on Computational Humanities Research (CHR 2020 ). Amsterdam, the Netherlands, 2020 , pp. 290 - 297 . doi: 10 .48550/arXiv.2109.08589.

[22]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , and E. Duchesnay. “ Scikit-Learn : Machine Learning in Python” . In:Journal of Machine Learning Research 12.1 ( 2011 ), pp. 2825 - 2830 .

[23]

Pöttker . “ News and Its Communicative Quality: the Inverted Pyramid - When and Why Did It Appear?” In: Journalism Studies 4.4 ( 2003 ), pp. 501 - 511 . doi: 10 .1080/146167 0032000136596.

[24]

Qin ,

Cong , and

Wan . “ Topic modeling of Chinese language beyond a bag-ofwords” . In: Computer Speech & Language 40 ( 2016 ), pp. 60 - 78 . doi: https://doi.org/10.10 16/j.csl. 2016 . 03 .004.

[25]

Řehůřek and

Sojka . “ Software Framework for Topic Modelling with Large Corpora” . In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta , 2010 , pp. 45 - 50 .

[32]

Reimers and I. Gurevych. “ Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation” . In:Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online , 2020 , pp. 4512 - 4525 . doi: 10 .48550/arXiv.20 04 .09813.

Reimers and I. Gurevych. “ Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” . In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Hong Kong , China, 2019 , pp. 3982 - 3992 . doi: 10 .18653/v1/ D19 - 1410.

Schliebs ,

Bailey ,

Bright , and

P. N.

Howard . China's Public Diplomacy Operations: Understanding Engagement and Inauthentic Amplification of PRC Diplomats on Facebook and Twitter . Tech. rep. Oxford, UK: Programme on Democracy & Technology , 2021 .

Thunø and

K. L.

Nielbo . “ The Initial Digitalization of Chinese Diplomacy ( 2019 -2021): Establishing Global Communication Networks on Twitter” . InJ:ournal of Contemporary China 33 .146 ( 2024 ), pp. 244 - 266 . doi: 10 .1080/10670564. 2023 . 2195811 .

H. M. Wallach , D.

Mimno , and

A. McCallum. “ Rethinking

LDA : Why Priors Matter” . In: Advances in Neural Information Processing Systems . Vancouver, Canada, 2009 , pp. 1 - 9 .

Wevers ,

Kostkan , and

K. L.

Nielbo . “ Event Flow - How Events Shaped the Flow of the News, 1950 - 1995 ”. In: Proceedings of the Third Conference on Computational Humanities Research , CHR 2021 . Amsterdam, the Netherlands, 2021 , pp. 62 - 76 . doi: 10 .48550/ar Xiv. 2109 .08589.

Zhao ,

Qin , and

Wan . “ Topic Modeling of Chinese Language Using CharacterWord Relations” . In: Neural Information Processing . Berlin, Heidelberg, 2011 , pp. 139 - 147 .

2024-05-012024-05-052024-05-092024-05-132024-05-172024-05-212024-05-252024-05-292024-06-022024-06-062024-06-102024-06-142024-06-18 2024-05-01 2024-05-05 2024-05-09 2024-05-13 2024-05-17 2024-05-21 2024-05-25 2024-05-29 2024-06-02 2024-06-06 2024-06-10 2024-06-14 2024 - 06 - 18 Figure 9 : The distributions over time for five topics with high pseudo-probabilities during Xi Jinping's European tour . These topics are generated by the 10-topic KeyNMF models for Oushinet and Xinouzhou .