Machine Learning Methods for Indicating Cultural Biases in Spoken Russian Language: Dominants and Trends of Modern Society Anna Chizhika, Yulia Zherebtsovab, Alexander Sadokhinc a Saint Petersburg State University, 10th line of Vasilievsky island, 49, St.-Petersburg, 199178, Russia b Inforser Engineering, Ryazansky Prospekt, 24, building 2, 109428, Moscow, Russia c Russian State Social University, 107076, Stromynka str., 18/28, Moscow, Russia Abstract The formation of a socio-cultural layer is based on a person's understanding of himself and the world around him and the translation of this under-standing into abstraction. This inevitably leads to the emergence of cultural biases in society as an extreme form of separation of one social group from another. In fact bias is nonrandom errors in thinking. The growing cultural biases in society is preserved in the consciousness of individuals and further affects the possible interpretation and perception of the neighboring social group, and, therefore, the public mood, in other words, the level of aggressiveness of the society. Thus the problem of identifying cultural shifts is relevant for the scientific community. There are many methods based on surveys and their subsequent analysis. In this paper, we propose to use machine learning and analysis of the large collection of text data from social networks (public Telegram chat). This approach can complement the standard methodology, including helping to reveal hidden patterns by being able to cover large amounts of data. Keywords 1 machine learning, natural language processing, text clustering, cultural biases, text analysis, cultural code, cultural process 1. Introduction The view of the world formed by immersion of an individual in media space, which includes all possible communication channels (communication «one to one», «one to many», «many to many»). In the current moment, social networks accumulate the potential of the key influence on the consciousness of individuals due to their dynamic structure. A complex of stable associations, opinions and stereotypes around complex social phenomena is formed on their basis. Even where social networks are not the primary sources of information, there is a need to get an opinion of people to the news by this channel of mass communication and to compare own opinion with the opinion of others. The non- representativeness of information increases within horizontal communicative structures. Thus, social media texts not only build the information agenda, but radically affect to social mood and public opinion by spontaneously distorting the picture of objective reality. Therefore, social media texts form a subjective and biased point of view on events, creating a certain image of reality in the mind of the information consumer and influencing the course of events in the end. The peculiarity of communication in a social media lies in the specific flow of interpersonal perception processes. People find uncertainty in interpersonal relationships unpleasant, thus they are motivated to reduce it. Stereotyping and identification by affiliation with particular social groups have a strong influence on individual mind in this way and contribute to judge a person. This provides the IMS 2021 - International Conference "Internet and Modern Society", June 24-26, 2021, St. Petersburg, Russia EMAIL: a.chizhik@spbu.ru (A. 1); julia.zherebtsova@gmail.com (A. 2); sadalpetr@yandex.ru (A. 3) ORCID: 0000-0002-4523-5167 (A. 1); 0000-0003-4450-2566 (A. 2); 0000-0002-6420-6601 (A. 3) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 254 PART 2: Computational Linguistics situation in which social network participants are united by homogenized opinions by r receiving an average consciousness. Participants of communication are keen to get approval of the social group, thus the feedback effect (verbal response, likes, emoticons) provides an important incentive. Social networks are not a platform for spontaneously emerged communities and they are not isolated from other channels of interactions with information flows. Therefore, sociocultural tendencies penetrate there from media sphere in its multifaceted understanding (including through traditional media, opinion leaders, propaganda, etc.). The cycle of comments (and replies to them) is repeated depending on the severity of the topic and the activity of the initiators of the conversation thread. Acquiring the features of recursion, this cycle of communication forms public mood and stereo-types. The effectiveness of communication remains in question, because the existence of a rational conversation vector (constructive dialogue) is quite difficult to trace. Since the incentive to continue communication in this case is the approval of the majority, it is fair to assume that this kind of dialogue strengthens the existing cultural and social biases, and also creates the basis for the formation of an aggressive field around them. Subjective and systemic biases of social actors influence to the information choice and content features of texts which are presented in this communicative space. Subjective biases operate at the level of individual information processing in the context of current events. Such subjective biases could arise from shared values, information overloads, and cultural preferences. Some of the subjective prejudices are transformed into systemic ones over time, which shape the consciousness of individuals at the mesoscopic level (mass consciousness). This creates patterns in the mindsets of societies. These regularities cannot be observed at the level of a specific statement of individual. However these patterns can be observed by analyzing big data. In this paper we present mathematical methods for detection these collective spontaneous information filters, which form cultural and social biases that exist in modern Russian society. 2. Method The language model is the basis for mathematical methods of text analysis. It provides to calculate the probability that a word will follow a given word sequences as a continuation of the text. Thus a statistical language model helps to calculate the probability distribution function on a set of vocabulary sequences. One of the first methods of constructing language models was n-gram model [1, 2]. The probability of a word sequence is considered as the product of word probabilities, given the known previous ones. Therefore only a few previous words (n words) are matter for this kind of statistical analyze. Further various architectures based on machine learning algorithms and artificial neural networks were widespread as the basis of language models [3, 4, 5]. Neural network language models are divided into two groups: word-aware NLM and character-aware NLM. A good language model must capture two important properties of a natural language. The first one is correct syntax. Thus, a few previous words are sufficient for a relevant prediction of the next word, however the word order in a sentence becomes important item. The second property is coherence. Including large number of words is often required in order to understand the global meaning of a sentence or document (but the word order has less importance). Traditional N-gram models and neural probabilistic language models have difficulties in extracting global semantic information from text (because of a fixed-size context window), that is, polysemy and context-dependent nature of words are not taken into account. Consequently, contextualized language models are gradually gaining popularity, trying to take into account the context of the use of the word [6, 7, 8, 9]. This approach allows combining two necessary properties of a natural language (correct syntax and coherence). The process of train a language model begins with creating a collection of texts by natural language (dataset). A language model predicts the next word in a text, thus it should have seen a lot of examples to learn the language. Essentially the model calculates the probability the appearance of a word next to the known word sequence. This prediction is based on examples from the dataset. For the representation of words into a language model it is necessary to map words or phrases from the vocabulary to vectors IMS-2021. International Conference “Internet and Modern Society” 255 of real numbers (word embeddings). Consequently the semantic similarity of words or the frequency of their joint occurrence can be detected by comparing the distances between these vectors (cosine similarity). A classic example of this method shows the interdependency between pairs of words («king-man» and «queen-woman»). Figure 1 demonstrates that algebraic operations in this space correspond to operations on the meaning of words. Figure 1: Semantic relationships of words in a language model Similar calculations can be performed for sentences (Fig. 2). This allows calculating the similarity of the entered phrase (its semantic coherence or antonymic nature) to sentences from dataset that are already understandable to the language model. Figure 2: Matching sentences based on how the language model works Accordingly semantic relationships between words (or sentences) are gaining mathematical meaning in a vector space. This suggests that the most productive language models reproduce text sequences that contain typical biases for social micro and macro groups. These biases initially arise as an emotional reaction to different phenomena and antagonistic groups. The detection of these biases by language model is possible because its training occurs by calculating the joint occurrence of words. In other words, a language model would know better that «freedom» is used with the word «speech», the more often such idiom («freedom of speech») would be found in the training dataset. In summing up, the biases, that was found in datasets and fallen into language models, in general, can be characterized as actual social landmarks of society. 3. Typology of cultural bias The following types of cultural biases can be identified: national or religious ideals, social connotations, gender stereotypes and aggressive statements of a general nature. Cultural biases in natural language can be present in a latent form or in a direct manifestation of an attitude to the object of a statement. The collected text data (on the basis of which the language model works) attracts to socio-demographic and mental stereotypes, traditions and patterns of behavior accepted in society. Therefore, it seems interesting to analyze the social connotations and contexts of cultural biases. In this paper, we provide the analysis of biases in two contexts: the ones that at the descriptive level reflect social mood [10, 11, 12] on a specific topic, and those that outline the characteristics of a social group. A group is people with the same markers (gender, nationality, etc.). 256 PART 2: Computational Linguistics 4. Experiments and Results The public Telegram chat has been chosen as the data source (chat of the news channel Mash – «MACH», 1608 participants, not moderated). Thus, the overall text collection consists of 556 354 records, the first of which was dated 2018-07-05, and the last - 2020-12-03). Clustering of messages provided first characteristics of the chat conversations, including the information about topics' content and their close interaction with each other. The cluster analysis of the dataset was performed with the k-means algorithm. The vector matrix was created using Word2Vec model trained on our dataset in order to obtain more unambiguous result of partitioning into cluster groups. Figure 3 shows that the clusters are very close to each other, and even sometimes intersect in space. Figure 3: Cluster ratio The next step was to identify the associative chains of terms. Pairs and triples of words were taken to take this goal. The model found the most semantically close terms from the dataset to chosen items (the theoretical algorithm of the model is shown in Fig. 1). The model finds sets of words that are close in meaning (quasi-synonyms), the meanings of which can differ in several characteristics (for example, in relation to the speaker) and change depending on the context. The closeness of the word to the term can be interpreted either as equality («she» = «her» = «girl» = «wife» = ...) or as a word very well associated with the term («she» = «girl» = «wife» => husband). Thus, only those words that are closely interrelated in the cultural code can catch into associations. An example of the resulting associations is shown in Table 1. Table 1 Semantic associations to the words «freedom», «democracy», «Internet» Liberty Democracy Internet ('recognize', ('opposition', ('anonymity', 0.9017236232757568), 0.9191081523895264), 0.6254202723503113), ('corruption', ('communism', ('vpn', 0.8983240723609924), 0.9145876169204712), 0.6117355227470398), ('punishment', ('equality', ('doesn't work', 0.8904907703399658), 0.9139655828475952), 0.6087996959686279), ('citizen', 0.880867063999176), ('develop', ('free access', ('crime', 0.8789682388305664), 0.9133450984954834), 0.5988985300064087), ('revision_year', ('putinsky', ('telegram', 0.871111273765564), 0.9130712747573853), 0.5986903309822083), ('ratio', 0.8698971271514893), ('capitalism', ('rkn', ('fairness', 0.7563846111297607), 0.8889749646186829), 0.5934107899665833), ('death_punishment', ('monarchy', ('outage', 0.7523390054702759), 0.8852110505104065), 0.5892473459243774), ('monarchy', ('socialism', ('space', 0.7488603591918945), 0.8837734460830688), 0.5862609148025513), IMS-2021. International Conference “Internet and Modern Society” 257 Liberty Democracy Internet ('liberalism', ('vertical', ('default', 0.7421742081642151), 0.8762210607528687), 0.5862008929252625) ('acting_power', ('scrap', 0.7411474585533142), 0.8600466251373291), ('civil_society', ('dictatorship', 0.7408543229103088) 0.8596632480621338), ('mentality', 0.8592602610588074) People most often associate the term «democracy» with opposition. The next logical link from this word leads to «communism». Thus the evidence base for discussions around democracy is the previous political system of our country. Obviously the understanding of the term is reduced to cultural and social contexts colored by national history. The words «equality» and «develop» frequently appear in conversations on this topic. This fact can probably be classified as hopes for a brighter future. It is noteworthy that the adjective «putinskii» (the time of something in association to the period of Putin's presidential term) appears in the seven most closely related terms. It strong links the discussion of democracy with the current agenda, because this word clearly indicates a non-abstract line of reasoning). We also investigated the reflection of the agenda through collocations. A collocation is a phrase that has signs of a syntactically and semantically integral unit (stable phrases). Highlighting of them can help delineate the social and political tendencies of the social macro groups. For example, throughout the entire data collection (more than 500 thousand messages), «Russia» most often occurs with the word «president»; such phrases as «Russians forward» and «Putin is the president» have shown themselves as stable collocations. Below is an example of identified collocations (Table 2). Table 2 Collocation examples Liberty the Internet 'freedom of speech', 'bad_internet', 'deprivation of freedom', 'sovereign_internet', 'restriction freedom', 'internet_freedom', 'release from prison.', 'free_internet', 'right to freedom', 'internet_access', 'internet freedom' 'internet_passport' The same terms as in previous example were taken for clearly interpretation of the results. Identification of stable collocations complements the ability to assess the social and cultural landscapes by the language models. If in the above example the term «liberty» was associated exclusively with criminal liability and offenses, then when searching for collocations, topics appeared that interest people most likely in connection with the term «democracy». However, there are no stable phrases with the term "democracy" in this case of text analysis, while at the previous step of the study, tendencies were identified. At the same time, the detection bounds of the term "Internet" in this way gives good results and it could be used to deep understanding the public mood. Multidimensional space visualizations of word vectors can help in interpreting relationships between word embeddings. Figure 4 presents the visualization of the results described above for associative bounds to the words «liberty», «democracy», «Internet». The method of nonlinear dimensionality reduction (T-SNE) [13] was used for this purpose. The basic principle of t-SNE is to reduce the pairwise distance between points while maintaining their relative position. Thus the algorithm constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. It becomes 258 PART 2: Computational Linguistics possible to map high-dimensional data to a low-dimensional space, while the location structure of the neighboring points is saved. Figure 3: Semantic associations to the words «liberty», «democracy», «Internet» Information about semantic correlations of the texts has been received by visualizing grouped data. The figure demonstrates that vector representations of words fix the semantic relations between such categories as the political system of the country, gender stereotypes, and even there are clearly seen the associations of swear words with discussions of dissatisfaction with that or another phenomenon, and etc. This kind of analysis provides direct information about cultural biases based on mathematical apparatus, that is, with a certain accuracy and impartiality. In developing this strategy of text analysis in its deeper condition, the most frequent words (the first 4 in frequency) were taken from each cluster, which were detected on the first step of the research (Fig. 5). Figure 4: Sociocultural analysis of the identified clusters The presence of sufficiently clearly grouped semantic clouds shows the tendencies of public opinion. In particular, the totality of the formed clusters and their content reveal the tendencies present in society related to the separation of male and female roles, as well as attitudes towards national minorities; at the same time, it is clearly seen the topics around COVID-19 (such as emergency medical services, general news about the virus, discussion of hospitals). In terms of mathematical and computational linguistics the biases implies a shift from the selected item to the left or to the right along the space coordinate axes. By way of illustration, the example of the bias between the words «dictatorship» and «democracy» is shown in Figure 6. The X-axis (horizontal) is set from «democracy» to «dictatorship»: words close in meaning to democracy (within the studied texts) are on the left, and words similar to the word «dictatorship» are on the right. It was decided to use «citizen» as the anchor word. IMS-2021. International Conference “Internet and Modern Society” 259 Figure 5: An example of the social and cultural biases which are detected in data collection from chat dialogs (the 0X axis is stretched between the words «democracy» and «dictatorship», point (0,0) is « civil») It turned out that «citizen» is a very good choice of the anchor word, because its position (in the middle of the projection, zero mark on the X-axis) indicates the neutrality of the term within the context of our research. The words that are attracted to the pole of the word «democracy» (the left of the zero mark on the X-axis) are outlined in green for convenience; words that are semantically connected with «dictatorship» (right) are purple. Accordingly, the closer a word is to the left edge of the x-axis, the more clearly it illustrates public attitudes toward «democracy». Y-axis displays the spread of words in their ideological differences and helps to detect two semantic groups that characterize this phenomenon in the public consciousness. On the one hand, an opinion clearly emerges in the mass consciousness that democracy is established with the active participation of society in this process (all words that fall into a significant sample, such as «person», «manifesto», «activist», are easily summarized in the category «civil society»). On the other hand, the words «art», «scientist», «peace» and «freedom» can also be highlighted as markers of the general idea of what democracy provides. Words that are «drawn» to the right side (to «dictatorship») are «censorship», «church», «raise the retirement age», «taxes». Obviously the clearly expressed attitude to the term is seen. Cultural biases of meanings between synonymous words become convenient to trace due to this item arrangement in space. For example, «police» - «cop» and «reform» - «law». If «police» is an attribute of a democratic society, then «cop» refers to a dictatorship in Russian language; «reform» is associated in the public perceptions with democracy, and the phrases with words «law» and «bad law» appear for the dictatorship. 5. Conclusion Examples of aggressive cultural biases (although they have been identified) are deliberately not shown in the results. The purpose of this paper is to describe the potential of the method without deep analysis of specific stereotypes and public opinion. Thus, our experiments demonstrate that the method is applicable to interpret cultural biases, which are formed in public consciousness by active using of social media. It is important to consider a number of factors before using statistical language models, such as the definition of the subject and functional boundaries of the object under study, the nature of the object under study and possible linguocultural consequences of its use, the detection the role of the object in language (natural language as a repository of cultural code), that determines its place in the system of linguacultural universals. An important application of this method can be the identification of aggression in social groups through text data. 260 PART 2: Computational Linguistics 6. References [1] Jelinek, F. Computation of the probability of initial substring generation by stochastic context free- grammar. Computational Linguistics. Vol. 17, № 3, 315–323 (1991). [2] Stolcke, A. Precise n-gram probabilities from stochastic context-free grammars / A. Stolcke, J. Segal // Proceedings of the 32th Annual Meeting of ACL, 74–79 (1994). [3] Mikolov, T. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of Workshop at ICLR (2013). https://papers.nips.cc/paper/5021-distributed- representations-of-words-and-phrases-and-their-compositionality.pdf, last accessed 2021/03/19. [4] Joulin, A., Grave, E., Bojanowski, P., Mikolov, T. Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–431 (2016). [5] Pennington, J., Socher, R., Manning, C. D. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 1532-1543 (2014). [6] Che, W., Liu, Y., Wang, Y., Zheng, B., Liu, T. Towards better UD parsing: Deep contextualized word embeddings, ensemble, and treebank concatenation. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 55–64 (2018). [7] Peters, M.E., Neumann, M., Iyyer, M. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237 (2018). [8] Artetxe, M., Schwenk, H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics. V.7, 597–610 (2019). [9] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. Language Models are Unsupervised Multitask Learners. Technical Report OpenAi (2018) https://d4mucfpksywv.cloudfront.net/better-language- models/language_models_are_unsupervised_multitask_learners.pdf, last accessed 2021/03/19. [10] Yadov V.A. Ideology as a form of spiritual activity of society. In Russ. (1961) [11] Porshnev B.F. Social psychology and history. In Russ. (1979) [12] Uznadze D.N. Installation psychology. In Russ. (2001) [13] Maaten L., Hinton G. Visualizing data using t-SNE //Journal of machine learning research. 9, 2579-2605 (2008).