Using Word Embeddings for Immigrant and Refugee Stereotype Quantification in a Diachronic and Multilingual Setting Danielly Sorato1 1 Research and Expertise Centre for Survey Methodology, Universitat Pompeu Fabra, Barcelona, Spain Abstract Languages are complex and systematic instruments of communication that reflect the culture of a given population. Amongst the many phenomena that can be observed by studying language, there are the social biases, such as stereotypes. The use of stereotypical framing in discourse can be very detrimental, especially when used by media and politicians, which are often responsible for distortions regarding the outgroup’s (e.g., immigrants, refugees) image inside the country. Such distortions can foster fear and encourage hate-motivated attitudes, leading to problematic outcomes. This paper describes our framework to quantify stereotypical associations concerning immigrants and refugees in public discourse, using a multilingual and diachronic setting. We present our research design and methodology concerning a experiment with a multilingual corpus of parliament texts, for the period of 1996 to 2018. Keywords Word embeddings, Diachronic analysis, Multilingual analysis, Computational sociolinguistics 1. Introduction Stereotype is type of social bias that is present when discourse about a given group overlooks the diversity of its members and focuses only on a small set of features [1, 2], which can be observed by studying language. However, like society, languages are not static, by analyzing language over time, it is possible to gain insights into the dynamics of social, cultural, and political phenomena reflected in texts [3], such as negative stereotypes of immigrant groups. Alongside the growing levels of immigration inflows experienced in European countries in recent decades, the increasing negative framing of immigrants and refugees in public discourse have become a major concern [4, 5, 6, 7, 8, 9]. The media and politicians or key social actors are often responsible for distortions regarding the ingroup’s perceptions and attitudes towards outgroups inside the countries [10, 11, 5, 12]. Such distortions can foster fear and encourage anti-immigration attitudes, leading to problematic outcomes. The misperceptions concerning immigrant populations is especially timely and relevant, having played a major role in important political events, such as the Brexit and in the increase of support of extreme right-wing political parties and rising nationalism in Europe [11, 13, 14, 15]. Doctoral Symposium on Natural Language Processing from the PLN.net network 2022 (RED2018-102418-T), 21-23 September 2022, A Coruña, Spain. $ danielly.sorato@upf.edu (D. Sorato)  00000-0002-4691-7231 (D. Sorato) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Nonetheless, manually analyzing texts spanning several years of public discourse is unfeasible due to the large amount of data involved. As such, computational methods for diachronic linguistic analysis are play a crucial role, and ongoing research shows that word embeddings models are helpful tools to this end, since they contain machine-learned biases in their geometry that closely depict social stereotypes [16, 17, 18, 5, 19]. Although such models should be carefully tested for biases and not blindly applied to downstream computational applications due to ethically concerning outcomes [20, 21, 22], they can be a valuable tool for sociolinguistic analysis on large volumes of textual data. In past work conducted in this PhD, we analyzed the dynamics of stereotypical associa- tions towards seven of the most prominent ethnic groups living in Spain (British, Colombian, Ecuadorian, German, Italian, Moroccan, and Romanian) in the period of 2007 to 2018 using word embedding models trained with news items from the Spanish newspaper 20 Minutos [23]. We investigated biases concerning concepts related to crimes, drugs, poverty, and prostitution, exploring the relation between the stereotypical associations and sociopolitical variables (e.g., GPD per capita (PPP) of the groups’ countries of origin, unemployment rates). The interpreta- tion of main effects and interactions with sociopolitical predictors in our multilevel modelling approach pointed that the texts exhibit stereotypical associations, especially for the Colombian, Ecuadorian, Moroccan and Romanian groups. In our ongoing research, we extend our study to a multilingual setting and a different domain: political discourse. Our goal to quantify and compare the strength of stereotypical associations towards immigrants and refugees in the period of 1998 to 2018 concerning concepts such as crimes, poverty, and trafficking in the discourse of British, Danish, Dutch, and Spanish par- liaments. Moreover, other than analyzing the stereotypes through the geometries of vector spaces, we aim to examine the effects of sociopolitical variables (e.g., immigration inflows, crim- inality rates) in the stereotypical association time-series using a Bayesian multilevel modelling approach. Finally, we aim to understand the different ways that the bias manifest itself in the vector spaces of different types of embeddings, such as static versus contextual embeddings, and the use of words versus sentence embeddings. This paper is organized as follows. In Section 2 we discuss related works. Subsequently, in Section 3 we state our research questions, In Section 4 present metrics, data, model training, and evaluation. Finally, in Section 5 we present our proposed discussion points. 2. Related Work Human generated data is full of both intentional and non-intentional stereotypes. However, there are certain types of stereotypes that impose special difficulties, since they can be subtle and often do not rely on personality traits (e.g., honest, empathetic), such as the case of stereotypes about immigrants [1]. In this context, word embeddings showed as a valuable tool, by means of enabling efficient methods for analyzing and quantifying linguistic and social phenomena in natural language. Overall, most works concerning the study of machine learned biases have English as target language, or approach exclusively gender bias [18, 24, 17, 16, 25, 26, 27, 20]. Nonetheless, biases can exist in all human languages, as well as in many shapes and forms, which calls for the conduction of research using other target languages and biases. Wevers quantified gender biases in 40 years of news published in six Dutch newspapers. Tripodi et al. investigated the antisemitism in public discourse in France, by using diachronic word embeddings trained on a large corpus of French books and periodicals containing keywords related to Jews. Sánchez-Junquera et al. detected stereotypes towards immigrants in political discourse by focusing on the frames used by political actors. They created their own taxonomy to capture immigrant stereotype dimensions and produced an annotated dataset with sentences that Spanish politicians have stated in the Congress of Deputies, which was then used to train classifiers to detect stereotypes. Kroon et al. quantified the dynamics of stereotypical associations concerning several outgroups in 11 years of Dutch news data, focusing on the difference of such associations regarding the group membership (ingroup vs outgroups). Lauscher et al. conducted an analysis about racism and sexism related biases in Arabic word embeddings across different types of embedding models and texts (e.g., user-generated content, news), dialects, and time. The literature concerning bias detection in multilingual settings is still scarce and recent, as such scenario imposes greater challenges than monolingual ones. Câmara et al. quantified gender, racial, ethnic, and intersectional social biases across five models trained on sentiment analysis tasks in English, Spanish, and Arabic. Ahn and Oh verified the existence of ethnic biases in monolingual BERT models for English, German, Spanish, Korean, Turkish, and Chinese, while proposing a new multi-class bias measure to quantify the degree of ethnic bias in such language models. Further, they proposed two bias mitigation methods using multilingual and word alignment approaches. Névéol et al. contributed to the analysis of multilingual stereotypes by creating an English and French dataset1 that enables the comparison across such languages, while also characterizing biases that are specific to each country (United States and France) and language. Their dataset includes biases types such as ethnic, gender, sexual orientation, nationality, age, among others. Such dataset was then used to verify stereotypes in three French and one multilingual language models. Our study distinguishes itself from the aforementioned studies by (i) the interdisciplinarity with social survey research, as the selected survey questions measure attitudes of the ingroup towards immigrants and can be interpreted as a proxy for cultural/economic threat perception; (ii) our choice of multilevel modeling to combine types of phenomena (linguistic and social) and account for group effects; and (iii) the use of fine-grained lists to investigate stereotypical portrayals (e.g., concepts related to proverty, drugs, human trafficking, prostitution). Addition- ally, we contribute to the scarce literature on stereotypical bias analysis with non-English data sources (Danish, Dutch, and Spanish) and multilingual settings. 3. Research Questions For the multilingual setting of this research, our main objective is to quantify and compare the strength of association between immigrants and refugees and stereotypical concepts (e.g., crimes, unemployment, poverty) in the discourse of Danish, Dutch, British, and Spanish parlia- ments across time (1996-2018). We intent to analyze the vector spaces of different embedding techniques, e.g., static versus contextual embeddings, words versus sentence embeddings. Fi- 1 https://gitlab.inria.fr/french-crows-pairs/acl-2022-paper-data-and-code nally, we aim to examine the effect of sociopolitical indicators that are relevant to the context of attitudes towards immigrants on our trends, with the objective of verifying if demographic trends correlate with our reported stereotypical associations. We achieve the aforementioned objectives by seeking the answers to the hereby stated research questions: • Can we track stereotypes about immigrants and refugees in political data across time using different embedding techniques? • How can we systematically compare biases in the vector spaces of different embedding techniques? • Can we compare and find patterns in the stereotypical association time-series for different languages? • Can we inspect effect of country-specific sociopolitical variables (e.g., immigration inflows, public opinion measured by survey, criminality rates) on computed time-series? 4. Methodology This thesis revolves around the study the dynamics of the stereotypical associations concerning outgroups in European public discourse (e.g., news, political speech) over time using embedding models. In previous work, we studied the stereotypical associations towards British, Colombian, Ecuadorian, German, Italian, Moroccan and Romanian nationalities using static embeddings (Fasttext implementation [32]) trained in the news domain, considering the years 2007 up to 2018 in our analysis. In our current setup, we adopt a multilingual perspective, a different domain and time span: parliamentary speeches covering the years 1996 up to 2018. In addition, we analyze stereotypical associations towards immigrant and refugees, rather than specific nationalities. In order to compute the association trends over time in this new setting, we start by training static language-specific skip-gram embedding models using our target corpora. To answer our research questions, we adopt the following data, metrics, and models. 4.1. Data For our monolingual case study, we compiled the Corpus of Spanish news 20 Minutos [33]. The corpus contains news articles written in Spanish from Spain that were web-scraped from the newspaper’s website 20 Minutos2 . Such dataset was split by year, allowing us to train 12 yearly word embedding models. To train embedding models in our multilingual setup, we combine the Danish, Dutch, English and Spanish portions of the following parliamentary corpora: • Europarl [34] (release 7); • Parlspeech V2 [35]; • ParlaMint [36]; 2 https://www.20minutos.es/ • IM-PRESS/PRESS, Written Question, Written Question Answer, Oral Question and Questions for Question Time portions3 of the Digital Corpus of the European Parliament (DCEP) [37]. Like it was done in the monolingual study, we split our final language-specific datasets by year to then train the embedding models (4 languages x 23 years = 92 models). 4.1.1. Sociopolitical data The sociopolitical variables for our monolingual study were taken from the Instituto Nacional de Estadística (INE)4 and the European Social Survey (ESS)5 . We used as indicators the number of foreign population by nationality residing in Spain, the rate of the population receiving unemployment social benefits, the public opinion about immigration using survey questions from the ESS, and number of committed offenses. In our ongoing research, we will use country-specific sociopolitical indicators from the Eurostat6 (e.g., immigration influx, criminality rates, population by citizenship and labour status) and questions from the ESS. Additionally, we are studying the feasibility of including measurements of outgroup integration, and acceptance of immigrant and asylum policies. 4.1.2. Defining Multilingual lists It is crucial to ensure that concepts lists are balanced across languages and closely depict our intended domain. Our initial word list based on the multilingual European Migration Network (EMN) glossary of asylum and migration terms 7 . Such glossary contains approximately 500 terms and concepts reflecting the most recent European policy on migration and asylum. Then, we consulted with native speakers and a migration studies specialist to increase the selected initial subset derived from the EMN glossary in order to expand and identify other concepts of interest, e.g. human trafficking. Finally, we prompted our dataset and models to verify the frequency of such words, excluding those with low frequency, and add missing words pointed as similar by the models. The lists were again revised by the domain specialist. During the aforementioned process, we verified and ensured that our group and concepts vector representations had low variance across the years and languages, as a way to ensure that our findings could not be be attributed to instabilities in our vector representations. 4.2. Models As it was done in our Spanish case study, using the datasets filtered by year, we trained skip- gram embedding models using the Fasttext implementation. After training Only words that appeared at least 10 times in each yearly dataset were taken into account in the training phase, and the resulting word vectors were 𝐿2 normalized. We evaluate the quality of our models 3 Details about the corpus portions are available on https://joint-research-centre.ec.europa.eu/ language-technology-resources/dcep-digital-corpus-european-parliament_en 4 “National institute of Statistics” https://www.ine.es/ 5 https://www.europeansocialsurvey.org/ 6 https://ec.europa.eu/eurostat 7 https://ec.europa.eu/home-affairs/networks/european-migration-network-emn/emn-asylum-and-migration- glossary_en using generic word similarity benchmarks originally in English and then extended to other languages, such as the RG-65 and the MC-30 benchmarks. To test the effect of sociopolitical variables in our time-series, we adopt a multilevel modelling approach. A multilevel model is an extension of a regression, in which data is structured in groups and coefficients can vary by group [38]. Concerning the inspection of patterns in the computed stereotype time-series, Autoregressive Integrated Moving Average (ARIMA) models could be applied. 4.3. Metrics Distributional semantic models maintain the properties of vector spaces and adopt the hypothesis that meaning of a word is conveyed in its co-occurrences. Therefore, in order to measure the similarity between two given words represented by the vectors 𝑣1 and 𝑣2 we can apply the 𝐿2 normalized cosine similarity. In our published study, to quantify social stereotypes in the trained word embedding models, we used the bias score, as defined by Garg et al., since it has been externally validated by the authors through correlations with census data. The bias score captures the strength of the association of a given set of words 𝑆 with respect to two groups 𝑣1 and 𝑣2 as shown in Equation 1. The more negative that the bias score is, the more associated 𝑆 is toward group two whereas the more positive, the more associated 𝑆 is towards group one. ∑︁ 𝑏𝑖𝑎𝑠 𝑠𝑐𝑜𝑟𝑒 = 𝑐𝑜𝑠𝑖𝑛𝑒(𝑣𝑠 , 𝑣1 ) − 𝑐𝑜𝑠𝑖𝑛𝑒(𝑣𝑠 , 𝑣2 ) (1) 𝑣𝑠 ∈𝑆 As for testing for biases in the sentence and contextualized embeddings, we start our investi- gation by using sentence templates and principal component analysis (PCA) [39, 40]. 5. Discussion Aiming to improve our work by discussing it with the Natural Language Processing community, we bring the following research elements for deliberation in this doctoral symposium: 1. The creation of automatic, or semiautomatic procedures for extracting and balancing word lists that represent concepts across languages; 2. Validating the quality of embeddings trained in a specific domain, i.e. parliamentary speeches; 3. Applying the multilingual setup to contextualized embedding models (e.g., BERT, RoBERTA). The first point refers to the time-consuming and iterative process of creating the multilingual word lists and then balancing them across languages. Although we believe that the lists should be revised by domain specialists, automatic procedures for extracting initial word lists and verifying meaning equivalence across languages would be very beneficial to reduce the time spent in this step. Exploring the use of external resources with semantic information as an automatic method for creating the lists could facilitate the process. The second point concerns the verification of embedding quality with an approach that allows us to see if it correctly represents the intended domain, in this case are parliamentary speeches, rather than using generic word similarity benchmarks. Lastly, we wish to apply our framework to masked language contextualized embedding models such as BERT, but Therefore, we would like to discuss model architectures suitable for smaller datasets, or the use of pre-trained models like the Spanish RoBERTA [41] model. References [1] J. Sánchez-Junquera, B. Chulvi, P. Rosso, S. P. Ponzetto, How do you speak about im- migrants? taxonomy and stereoimmigrants dataset for identifying stereotypes about immigrants, Applied Sciences 11 (2021) 3610. [2] H. Tajfel, A. A. Sheikh, R. C. Gardner, Content of stereotypes and the inference of similarity between members of stereotyped groups., Acta Psychologica (1964). [3] A. Marakasova, J. Neidhardt, Short-term semantic shifts and their relation to frequency change, in: Proceedings of the Probability and Meaning Conference (PaM 2020), 2020, pp. 146–153. [4] M. J. Creighton, P. Schmidt, D. Zavala-Rojas, Race, wealth and the masking of opposition to immigrants in the netherlands, International Migration 57 (2019) 245–263. [5] A. C. Kroon, D. Trilling, T. Raats, Guilty by association: Using word embeddings to measure ethnic stereotypes in news coverage, Journalism & Mass Communication Quarterly (2020) 1077699020932304. [6] P. M. Sniderman, L. Hagendoorn, M. Prior, Predisposing factors and situational triggers: Exclusionary reactions to immigrant minorities, American political science review (2004) 35–49. [7] P. Sniderman, L. Hagendoorn, Multiculturalism and its discontents in the netherlands: When ways of life collide, 2007. [8] G. Lahav, et al., Immigration and politics in the new Europe: Reinventing borders, Cam- bridge University Press, 2004. [9] L. McLaren, H. Boomgaarden, R. Vliegenthart, News coverage and public concern about immigration in britain, International Journal of Public Opinion Research 30 (2018) 173–193. [10] R. Zapata-Barrero, Perceptions and realities of moroccan immigration flows and spanish policies, Journal of Immigrant & Refugee Studies 6 (2008) 382–396. [11] A. Gorodzeisky, M. Semyonov, Perceptions and misperceptions: actual size, perceived size and opposition to immigration in european societies, Journal of Ethnic and Migration Studies 46 (2020) 612–630. [12] R. Tripodi, M. Warglien, S. L. Sullam, D. Paci, Tracing antisemitic language through diachronic embedding projections: France 1789-1914, in: Proceedings of the 1st Interna- tional Workshop on Computational Approaches to Historical Language Change, 2019, pp. 115–125. [13] D. Herda, Too many immigrants? examining alternative forms of immigrant population innumeracy, Sociological Perspectives 56 (2013) 213–240. [14] Y. Pottie-Sherman, R. Wilkes, Does size really matter? on the relationship between immigrant group size and anti-immigrant prejudice, International Migration Review 51 (2017) 218–250. [15] E. Schlueter, P. Scheepers, The relationship between outgroup size and anti-outgroup attitudes: A theoretical synthesis and empirical test of group threat-and intergroup contact theory, Social Science Research 39 (2010) 285–295. [16] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, A. T. Kalai, Man is to computer programmer as woman is to homemaker? debiasing word embeddings, in: Advances in neural information processing systems, 2016, pp. 4349–4357. [17] H. Gonen, Y. Goldberg, Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 609–614. [18] N. Garg, L. Schiebinger, D. Jurafsky, J. Zou, Word embeddings quantify 100 years of gender and ethnic stereotypes, Proceedings of the National Academy of Sciences 115 (2018) E3635–E3644. [19] A. Lauscher, R. Takieddin, S. P. Ponzetto, G. Glavaš, AraWEAT: Multidimensional analysis of biases in Arabic word embeddings, in: Proceedings of the Fifth Arabic Natural Lan- guage Processing Workshop, Association for Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 192–199. [20] O. Papakyriakopoulos, S. Hegelich, J. C. M. Serrano, F. Marco, Bias in word embeddings, in: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, pp. 446–457. [21] J. Brandon, Using unethical data to build a more ethical world, AI and Ethics 1 (2021) 101–108. [22] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dangers of stochastic parrots: Can language models be too big?, in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021, pp. 610–623. [23] D. Sorato, D. Zavala-Rojas, M. d. C. C. Ventura, Using word embeddings to quantify ethnic stereotypes in 12 years of spanish news, in: Proceedings of the The 19th Annual Workshop of the Australasian Language Technology Association, 2021, pp. 34–46. [24] A. C. Kozlowski, M. Taddy, J. A. Evans, The geometry of culture: Analyzing the meanings of class through word embeddings, American Sociological Review 84 (2019) 905–949. [25] K. Kurita, N. Vyas, A. Pareek, A. W. Black, Y. Tsvetkov, Measuring bias in contextualized word representations, in: Proceedings of the First Workshop on Gender Bias in Natural Language Processing, Association for Computational Linguistics, Florence, Italy, 2019, pp. 166–172. doi:10.18653/v1/W19-3823. [26] T. Manzini, L. Yao Chong, A. W. Black, Y. Tsvetkov, Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 615–621. doi:10.18653/ v1/N19-1062. [27] M.-E. Brunet, C. Alkalay-Houlihan, A. Anderson, R. Zemel, Understanding the origins of bias in word embeddings, in: International conference on machine learning, PMLR, 2019, pp. 803–811. [28] M. Wevers, Using word embeddings to examine gender bias in dutch newspapers, 1950- 1990, in: Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, 2019, pp. 92–97. [29] A. Câmara, N. Taneja, T. Azad, E. Allaway, R. Zemel, Mapping the multilingual margins: Intersectional biases of sentiment analysis systems in English, Spanish, and Arabic, in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 90–106. [30] J. Ahn, A. Oh, Mitigating language-dependent ethnic bias in BERT, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 533–549. doi:10.18653/v1/2021.emnlp-main.42. [31] A. Névéol, Y. Dupont, J. Bezançon, K. Fort, French crows-pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than en- glish, in: ACL 2022-60th Annual Meeting of the Association for Computational Linguistics, 2022. [32] P. Bojanowski, É. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (2017) 135–146. [33] P. Razgovorov, D. Tomás, et al., Creación de un corpus de noticias de gran tamano en espanol para el análisis diacrónico y diatópico del uso del lenguaje, Comité Editorial 62 (2019) 29–36. [34] P. Koehn, Europarl: A parallel corpus for statistical machine translation, in: Proceedings of machine translation summit x: papers, 2005, pp. 79–86. [35] C. Rauh, J. Schwalbach, The parlspeech v2 data set: Full-text corpora of 6.3 million parliamentary speeches in the key legislative chambers of nine representative democracies (2020). [36] T. Erjavec, M. Ogrodniczuk, P. Osenova, N. Ljubešić, K. Simov, A. Pančur, M. Rudolf, M. Kopp, S. Barkarson, S. Steingrímsson, et al., The parlamint corpora of parliamentary proceedings, Language resources and evaluation (2022) 1–34. [37] N. Hajlaoui, D. Kolovratnik, J. Väyrynen, R. Steinberger, D. Varga, Dcep-digital corpus of the european parliament, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 2014. [38] A. Gelman, J. Hill, Data analysis using regression and multilevel/hierarchical models, Cambridge university press, 2006. [39] R. A. Ch’avez Mulsa, G. Spanakis, Evaluating bias in Dutch word embeddings, in: Proceed- ings of the Second Workshop on Gender Bias in Natural Language Processing, Association for Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 56–71. [40] K. Kurita, N. Vyas, A. Pareek, A. W. Black, Y. Tsvetkov, Measuring bias in contextualized word representations, in: Proceedings of the First Workshop on Gender Bias in Natural Language Processing, 2019, pp. 166–172. [41] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R. Penagos, A. G. Agirre, M. Villegas, Maria: Spanish language models, Procesamiento del Lenguaje Natural 68 (2022). doi:10.26342/2022-68-3.