=Paper= {{Paper |id=Vol-3683/CEUR-Template-2col6 |storemode=property |title=Mapping Global Protest Tendencies: Geolocating Trends and Topics Through Wikipedia Analysis |pdfUrl=https://ceur-ws.org/Vol-3683/paper10.pdf |volume=Vol-3683 |authors=Jiyun Beak,Ludovic Moncla |dblpUrl=https://dblp.org/rec/conf/ecir/BeakM24 }} ==Mapping Global Protest Tendencies: Geolocating Trends and Topics Through Wikipedia Analysis== https://ceur-ws.org/Vol-3683/paper10.pdf
                                Mapping Global Protest Tendencies: Geolocating
                                Trends and Topics Through Wikipedia Analysis
                                Jiyun Beak1,2,* , Ludovic Moncla2
                                1
                                    Korea Advanced Institute of Science and Technology
                                2
                                    INSA Lyon, CNRS, UCBL, LIRIS, UMR 5205, F-69621


                                              Abstract
                                              This study investigates the diverse manifestations of ’protest’ across cultures and regions, aiming to
                                              provide a nuanced understanding of global dynamics and their impact on human rights. Utilizing topic
                                              modeling methods, we extract a substantial corpus of documents from the English Wikipedia, employing
                                              precise clustering techniques to categorize various types of protests based on semantic elements such
                                              as race, gender, and language. Through cartographic visualization, we illustrate the frequency and
                                              distribution of different protest topics. The primary goal is to identify geographic hotspots of human
                                              rights conflict, offering a detailed analysis of regional differences in protest propensity. This research
                                              serves as an initial step towards a comprehensive global understanding of protest dynamics and their
                                              implications for human rights worldwide.

                                              Keywords
                                              Geographical Topic Modelling, Zero-shot topic modeling, Semi-supervised topic modeling




                                1. Introduction
                                Throughout history, protests have sparked numerous revolutions around the world or at least
                                marked a historic moment for countries. Recent examples include the Black Lives Matter
                                movement in the United States [1], the Candlelight protests in South Korea [2] or the yellow
                                vest protest in France [3]. However, the nature of protests can vary greatly depending on the
                                cultural context and some of the most lasting impacts of social movements are not only in the
                                political realm, but also in everyday life [4].
                                   This article describes the use of topic modeling and geographic mapping from Wikipedia
                                article content focusing on social movements to analyze protest activity in different countries1 .
                                Topic modeling is commonly used in open data sources such as social media and Wikipedia. It
                                helps to discover themes that recur in texts and to understand the evolution of topics in social
                                media data. Social movements and protests are actively discussed online and thus offer a large
                                amount of interesting data for topic modeling methods. For example, Latent Dirichlet Allocation
                                (LDA) topic modeling has been used to conduct a comparative study of the #BlackLivesMatter
                                and #StopAsianHate movements [5], which were actively pursued on social media. Topic

                                GeoExT 2024: Second International Workshop on Geographic Information Extraction from Texts at ECIR 2024, March 24,
                                2024, Glasgow, Scotland
                                *
                                  Corresponding author.
                                $ bamjy99@naver.com (J. Beak); ludovic.moncla@insa-lyon.fr (L. Moncla)
                                 0000-0002-1590-9546 (L. Moncla)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                1
                                    Dataset and code are available on GitHub: https://github.com/ludovicmoncla/mapping-global-protest-tendencies




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
modeling combined with geographic information science have been used in several studies such
as for mapping tweets [6] or track online discussions geographically over time [7].
   Our methodology aims to highlight how often and where different protest topics occur based
on Wikipedia articles. Our main objective is to pinpoint geographic areas where human rights
conflicts are most intense. This study seeks to provide a first step towards the understanding of
the dynamics of protest movements and their influence on human rights globally.


2. Methodology
2.1. Topic Modeling
This study explores the application of topic modeling to identify themes related to protests and
human rights in Wikipedia articles. Simultaneously, it clusters articles within these identified
topics. For this purpose, we experimented the BERTopic2 framework [8], a deep learning-based
topic modeling approach. BERTopic aims to overcome the limitations of traditional topic model-
ing methods like LDA and NMF. Unlike Bag-of-Words models that ignore semantic relationships,
BERTopic uses embeddings (default: BERT Sentence Transformers), Dimensionality Reduction
(UMAP), clustering (HDBSCAN), Tokenizer, and Weighting Scheme (c-TF-IDF) in a sequence
to create coherent topics. This approach allows for more meaningful topic representations.
BERTopic variations include semi-supervised, multimodal, hierarchical, and dynamic, with ad-
ditional models like KeyBERT or GPT for enhanced topic fine-tuning. OpenAI’s Topic Modeling
utilizes GPT-3.5 to improve clustering by generating synthetic text samples as topic labels.
   This study explores ’Zero Shot Topic Modeling’, a semi-supervised method that identifies
predefined topics in texts with selected labels and uncovers new topics when documents diverge
from these labels. It flexibly handles different outcomes, including the identification of both
zero-shot and clustered topics, solely zero-shot topics, or none at all. Zero-shot topics are
pinpointed through cosine similarity with predefined labels, and a combined BERTopic model
integrates both zero-shot and traditional topics.

2.2. Geographical Mapping
The second major step in our research pipeline involves the process of geographically mapping
the identified topics within Wikipedia articles related to different forms of protest. Our first
experiment is conducted at the country level, with the goal of gaining a comprehensive under-
standing of the distribution of protest-related topics across different regions of the world. We
systematically extract all instances of country names and their corresponding adjectival forms
from the articles. This comprehensive compilation serves as the basis for our georeferencing
efforts. We then delve into the analysis of the frequency with which each country name is
associated with specific topics. This granular examination allows us to discern the prevalence
and distribution patterns of protest-related discourse within the context of individual countries,
facilitating a nuanced exploration of how these themes manifest and resonate across different
geopolitical landscapes.

2
    https://maartengr.github.io/BERTopic/
3. Experiments
Our experiment started with the gathering of nearly 10,000 Wikipedia entries featuring the
term "protest" in their title or content. Subsequently, we employed the Wikipedia API Python
wrapper3 to retrieve the textual content of these entries. Then, a language dependent prepro-
cessing phase involving traditional natural language processing techniques (i.e., tokenization,
lemmatization, and the removal of stopwords) was performed in order to prepare the dataset
for topic modeling.
   In an initial experiment, we used the zero-shot BERTopic method to compare predefined topic
embeddings with document embeddings via cosine similarity. Based on a threshold, documents
were either assigned to these zero-shot topics or clustered by the standard BERTopic model.
This approach helps identify both expected and unexpected topic clusters. For the analysis
of predefined protest tendencies we define seven labels: gender, nationality, ethnicity, race,
language, religion and disability. The minimum topic size has been set to 50 documents in order
to limit the number of new topics and the General Text Embeddings[9] has been used. Several
values for the minimum cosine similarity have been tested. Results show that, for 0.7, no new
cluster is found, while 30 and 40 new clusters are identified for 0.8 and 0.85, respectively. Having
40 new clusters reduces the size of predefined topics, in this case the nationality topic don’t even
exist. Then, among the 30 new topics, 20 seem to refer directly to locations (mainly countries)
rather than themes, such as [knesset palestine gaza israel], [syria syrian damascus assad],
[ukraine russia protest crimea], . . . Other topics more related to forms of protests, activism or
specific political themes rather than human rights are also interesting such as [song performed
music album], [statue monument erected plaza], or [nuclear protest antinuclear opposition].
However, the zero-shot learning showed his limits as some of the new clusters should have been
grouped with predefined one such as [apartheid opposition protest africa]. Additionally, because
of the minimum topic size limit, some articles are considered as outliers and not clustered such
as “School Strike for Climate”4 .
   For the geocoding part of our process, we counted the number of occurrences of each country
name (and their corresponding adjectival forms) within each document and topic. Country level
is a first step and allows us to reduce ambiguity of extracting place names with global coverage.
We used the GeoPandas5 and Matplotlib6 Python libraries to visualize the frequency of each
country name per topic (see Figure 1). Frequency mapping shows that topics have a different
worldwide distribution. The results show that USA is in the top five of the most frequent country
names for all the seven predefined topics while, Israel, India, Iran, United Kingdom and Russia
are in the top 5 for 3 topics. Also, France is in the top 5 for nationality and language, Ireland for
nationality and disability, Canada for disability and language, China for language, and South
Africa and Australia for race. The countries with the highest number of protests are known
for their diversity based on historical background. For example, Israel is known for being a
multi-ethic, multi-language country [10, 11]. Moreover, ethnic and national identities such as
"American" often conflict with the diverse realities within countries, complicating state-building
3
  https://github.com/martin-majlis/Wikipedia-API
4
  https://en.wikipedia.org/wiki/School_Strike_for_Climate
5
  https://geopandas.org/en/stable/
6
  https://matplotlib.org
and reviving ethnic movements with the historical forces of nationalism [12]. Furthermore, in
the contemporary world, inequalities at the intersections of global capitalism, race, and class
are intensifying [13], suggesting that one protest can escalate others in similar contexts.
   Enhancements are needed for accurately recognizing multi-word country names and various
spellings in texts. Additionally, a detailed analysis is required to determine the context and
relevance of the country names to the topics discussed.




Figure 1: Country name frequency per topic




4. Conclusion
In conclusion, this preliminary study shows promising results in the use of topic modeling
and geographical mapping to analyze protest activity across different countries in order to
provide a nuanced understanding of the dynamics of social movements and their impact on
human rights globally. By employing innovative methodologies such as BERTopic and zero-shot
topic modeling, this research provides insights into the prevalence and distribution patterns
of protest-related discourse, shedding light on the varying cultural contexts and historical
influences shaping these movements. Further investigation into the role of specific countries
within each protest topic is essential for a deeper understanding of their societal and historical
implications. Through continued research and analysis, we can strive towards fostering greater
awareness and advocacy for human rights issues worldwide.
References
 [1] F. Megan Ming, Can black lives matter within us democracy?, The ANNALS of the
     American Academy of Political and Social Science 699 (2022) 186–199. doi:10.1177/
     00027162221078340.
 [2] A. L. Park, A Recycling of the Past or the Pathway to the New? Framing the South Korean
     Candlelight Protest Movement, Journal of Asian Studies 81 (2022) 101–105. doi:10.1017/
     S0021911821001480.
 [3] S. Kipfer, What colour is your vest? reflections on the yellow vest movement in france, Stud-
     ies in Political Economy 100 (2019) 209–231. doi:10.1080/07078552.2019.1682780.
 [4] A. Morris, C. M. Mueller, Frontiers in social movement theory, Yale University Press, 1992.
 [5] X. Tong, Y. Li, J. Li, R. Bei, L. Zhang, What are people talking about in #blacklivesmatter
     and #stopasianhate? exploring and categorizing twitter topics emerged in online social
     movements through the latent dirichlet allocation model, in: Proceedings of the 2022
     AAAI/ACM Conference on AI, Ethics, and Society (AIES ’22), Oxford, United Kingdom,
     2022, p. 723–738. doi:10.1145/3514094.3534202.
 [6] D. Ghosh, R. Guha, What are we ‘tweeting’about obesity? mapping tweets with topic
     modeling and geographic information system, Cartography and geographic information
     science 40 (2013) 90–102. doi:10.1080/15230406.2013.776210.
 [7] M. G. Lozano, J. Schreiber, J. Brynielsson, Tracking geographical locations using a geo-
     aware topic model for analyzing social media data, Decision Support Systems 99 (2017)
     18–29. doi:10.1016/j.dss.2017.05.006.
 [8] M. Grootendorst, Bertopic: Neural topic modeling with a class-based tf-idf procedure,
     arXiv preprint arXiv:2203.05794 (2022). doi:10.48550/arXiv.2203.05794.
 [9] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, M. Zhang, Towards general text embeddings
     with multi-stage contrastive learning, 2023. arXiv:2308.03281.
[10] E. Ben-Rafael, E. Shohamy, M. Hasan Amara, N. Trumper-Hecht, Linguistic landscape
     as symbolic construction of the public space: The case of israel, International journal of
     multilingualism 3 (2006) 7–30.
[11] B. Spolsky, R. L. Cooper, The languages of Jerusalem, Oxford University Press, 1991.
[12] S. Olzak, Ethnic protest in core and periphery states, Ethnic and racial studies 21 (1998)
     187–217.
[13] S. Nazneen, A. Okech, Introduction: Feminist protests and politics in a world in crisis,
     2021.