A Comparison of Text Representation Methods for
      Predicting Political Views of Social Media Users

                              Anna Glazkova[0000-0001-8409-6457]

     University of Tyumen, 6, Volodarskogo Str., Tyumen, 625003, Russian Federation
                              a.v.glazkova@utmn.ru


       Abstract. The paper focuses on the task of predicting political views of social
       media users. The aim of this study is to identify the most effective method for
       representation textual information from user profile. We compared several text
       representation methods, including a bag of words modeling, averaged word2vec
       embeddings, Sentence Transformers representation, and text representations ob-
       tained with three BERT-based models, such as Multilingual BERT, Slavic-
       BERT, and RuBERT. We conducted our experiments on the dataset of VKon-
       takte users' data collected with VK API. We evaluated the effectiveness of bi-
       nary classification for the pages of users with radical political views, including
       ultraconservatives, communists, and libertarians, and users who are indifferent
       to politics. Further, we compared the impact of various text representations for
       distinguishing users belonging to different radical political movements, such as
       communists vs. libertarians, libertarians vs. ultraconservatives, ultraconserva-
       tives vs. communists. Best results were predictably shown by BERT-based
       models. Moreover, in each task, the best result was achieved by different mod-
       els.

       Keywords: Social Media, Political Preferences, Text Representation, BERT,
       VKontakte, Word Embeddings.


1      Introduction

Social media analysis is one of the key tasks of natural language processing and in-
formation retrieval. The aim of the analysis of social media is to develop powerful
methods and algorithms which extract relevant information from a large volume of
social network data [6]. Social media data processing is a significant part of various
natural language processing systems, such as social media monitoring and fact check-
ing [8; 13], interest discovery [27], health care [12; 21], business intelligence [31],
and security [29; 33].
   The widespread use of social network for social and political communication cre-
ates many opportunities to monitor the political views of large numbers of people in
real time [4]. Thus, social network profiles contain a lot of valuable information that
can be used as a material for sociological research or as a tool for political influ-
ence [9].
   In this paper, we perform a comparison of text representation methods for predict-
ing political views of social media users based on textual data posted on their personal
profiles. We consider the following types of text representations: a) a bag-of-words
representation [11]; b) averaged Word2vec embeddings [22]; c) text representations
from BERT [5], including embedding from Multilingual BERT, RuBERT [16],
SlavicBERT [2], and text representation using Sentence Transformers [26]. We con-
duct our experiments on the dataset collected using VKontakte API* and use textual
information from personal profiles of social media users to predict their political pref-
erences.
   The paper is organized as follows. Section 2 gives a brief description of text repre-
sentation methods used in this work. Section 3 is concerned with the dataset used for
this study. Section 4 presents our experiments and results. Section 5 is a conclusion,
and Section 6 contains acknowledgements.


2         Text Representation Methods

In this section we describe the methods of text representation we used in our work.
Bag of words. The bag of words (BoW) [11] model is a classical approach to text
representation for machine learning algorithms. This model describes the occurrence
of words within a document. The text is presented as a token counts matrix. The BoW
model is widely used in different natural language processing tasks. Nevertheless, it
suffers from some shortcomings, such as sparsity and word order ignoring.
Word2vec. The idea of word2vec [22] is based on the assumption that the meaning of
a word is affected by the words around it. This statement follows distributional hy-
pothesis [11]. Word embeddings assign a real-valued vector for each word and repre-
sent the word by the vector. The averaged word embeddings are calculated by sum-
ming of all words' vectors in a text and dividing the sum by the text length.
BERT (Bidirectional Encoder Representations from Transformers). This ma-
chine learning technique [5] based on transformer neural architectures presents state-
of-the-art results in a wide variety of natural language processing tasks, including text
classification, opinion mining, and others. BERT's key innovation is a bidirectional
training which helps to consider both left and right contexts.
   In our work, we evaluated the following models:
─ Multilingual BERT [5], a pretrained model by Google on the top 104 languages
  with the largest Wikipedia using a masked language modeling (MLM) objective.
─ RuBERT [16], a model by DeepPavlov [3] trained on the Russian part of Wikipe-
  dia and news data. Multilingual BERT was used as an initialization for RuBERT.
─ SlavicBert [2], a model trained on Russian News and four Wikipedias: Bulgarian,
  Czech, Polish, and Russian. Multilingual BERT was also used as an initialization
  for SlavicBERT.
─ Text representation obtained with Sentence Transformers [26], a framework that
  provides a method to compute dense vector representations for sentences and para-

*
    https://vk.com/dev/first_guide
                                                                                        3


    graphs based on BERT-based networks. In our experiments, we used a multilingual
    knowledge distilled version of multilingual universal sentence encoder [32] trained
    for the task of similar text detection. This multilingual knowledge distilled version
    supports 50+ languages.


3       Dataset

To conduct our experiments, we collected data using VKontakte API. The personal
profile of the VKontakte user contains a list of text and categorical fields. The user
can fill in these fields with his personal information. In particular, he can indicate his
political views by choosing one of 9 possible options, including moderate, conserva-
tive, liberal, socialist, monarchist, ultraconservative, libertarian, communist, and in-
different.
    The aim of this study is to compare the impact of different text representations for
predicting political views of social network users. For example purposes, we decided
to evaluate the effectiveness of binary classification of users identified radical politi-
cal views, such as ultraconservatives, communists, and libertarians, and users who are
indifferent to politics. Moreover, we compared the impact of text representations for
distinguishing users belonging to different radical political movements.
    For this purpose, we downloaded textual information from users' personal profiles.
This information is contained in text fields that are filled in by users in a free form.
These fields include descriptions of the user's activities, favorite music, movies, TV
shows, games, sources of inspiration, and the user's worldview. Text fields are mostly
filled in Russian, since VKontakte is especially popular in post-Soviet countries.
However, there are a large number of texts in other languages, for example English
and various Slavic languages [1; 7; 15].
    We combined text from all text fields of the profile and selected only those users
whose texts have a total lenght of at least 10 words. Further, for our experiments, we
selected users who indicated ultraconservative, communist, libertarian, or indifferent
political views. Table 1 shows the main characteristics of our data.

                               Table 1. Corpus description.

Political views    Number of      Avg length (number of       Avg length (number of sym-
                   texts          words)                      bols)
Communist          299            31.1                        260.06
Ultraconservative 240             27.66                       222.91
Libertarian       116             46.51                       392.5
Indifferent        799            38.63                       316.09
4      Results and Discussion

We conducted our experiments on Google Colab Pro† (CPU: Intel(R) Xeon(R) CPU
@ 2.20GHz; RAM: 25.51 GB; GPU: Tesla P100-PCIE-16GB with CUDA 10.1).
   Each BERT-based model (mBERT, SlavicBERT, and RuBERT) was fine-tuned on
the training set for 2 epochs. We used random seeds to fine-tune pretrained language
models and made attempts to combine them with other parameters. The models are
optimised using AdamW [20] with a learning rate of 2e-5, epsilon of 1e-8, max se-
quence length of 128 tokens, and a batch size of 32. We implemented our models
using Pytorch [23] and Huggingface’s Transformers [30] libraries.
   We utilized word2vec embeddings provided by RusVectores‡ [17]. The model was
trained on Russian Wikipedia§ texts collected in 2018. The vector size is 300. The
model was loaded and processed with Gensim [25]. Finally, we implemented the
BoW representation using Scikit-learn Python library [24].
   To reduce the class imbalance, we used a random oversampling technique imple-
mented with Imbalanced-learn Python library [18]. Random oversampling involves
supplementing the training data with multiple copies of some minority class examples
and can be a fast and effective solution for the problem of class imbalance [10, 28].
   We applied a Linear Support Vector Machine (Linear SVC) as a classifier with a
tolerance for stopping criteria equal to 1e-5. The classifier takes as an input various
text representations sequentially. We used the weighted F1-score as an evaluation
metric. The classifier was implemented with Scikit-learn [24]. We splitted our data to
train and test datasets in a 80-20 ratio and performed a 3-fold cross-validation. To
preprocess data, we used Pymorphy2 [14] and NLTK [19].
   The results are presented in Table 2.

                              Table 2. Results (weighted F1-score, %).

                                                                                           Avg
Text repre-     Indifferent     Communists       Libertarians -       Ultraconservatives -
                                                                                           F1-
sentation       - others        - libertarians   ultraconservatives   communists
                                                                                           score
BoW                56.74            65.91              78.65                52.04          63.34
Word2vec           63.61            64.42              70.05                53.19          62.82
Sentence
                   67.43            64.17              80.97                51.68          66.06
Transformers
mBERT              65.41            65.97              79.01                52.23          65.66
RuBERT             66.12             64.8              78.61                57.18          66.68
SlavicBERT         67.95            65.27              77.05                57.17          66.86


†
  https://colab.research.google.com/
‡
  https://rusvectores.org/en/
§
  https://ru.wikipedia.org/
                                                                                              5


   As can be seen from the table above, the best results in all tasks were obtained us-
ing BERT-based models. SlavicBERT archived 67.95% of F1-score on the indifferent
vs. others task. The classifier trained on mBERT embeddings showed 65.97 for the
communists vs. libertarians task. For the libertarians vs. ultraconservatives task, the
best result was shown with Sentence Transformers embeddings (80.97%). The highest
result for the ultraconservatives vs. communists task was achieved by RuBERT
(57.18%). The best averaged result was shown by SlavicBERT (66.86%).
   It can be seen from the data in Table 2 that the results for the ultraconservatives vs.
communists task are lower than for other tasks. At the same time, all classifiers show
their best results for the libertarians vs. conservatives task. This fact can be useful
when studying the interests of social groups with different political views.


5      Conclusion

In this study, we compared several methods to represent textual data from users’ pro-
files on social networks. The best results were obtained with BERT-based models,
which now show the state-of-the-art achievements in many natural language process-
ing tasks. In our further work, we plan to explore various ways of representing differ-
ent types of features for predicting political views in social media.


6      Acknowledgments

The reported study was funded by RFBR and EISR, project number 20-011-32031.


References
 1. Anisimova, O., Vasylenko, V., Fedushko, S.: Social networks as a tool for a higher educa-
    tion institution image creation. arXiv preprint arXiv:1909.01678 (2019).
 2. Arkhipov, M. et al.: Tuning multilingual transformers for language-specific named entity
    recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Proc-
    essing, 89-93 (2019).
 3. Burtsev, M. et al.: Deeppavlov: Open-source library for dialogue systems. In: Proceedings
    of ACL 2018, System Demonstrations, 122-127 (2018).
 4. Conover, M. et al. Predicting the Political Alignment of Twitter Users. In:
    ASSAT/SocialCom 2011, 192-199, IEEE, Boston (2011).
 5. Devlin, J. et al.: Bert: Pre-training of deep bidirectional transformers for language under-
    standing. arXiv preprint arXiv:1810.04805 (2018).
 6. Farzindar, A., Inkpen. D.: Natural language processing for social media. Synthesis Lec-
    tures on Human Language Technologies, 2(8), 1–166 (2015).
 7. Feshchenko, A. V. et al.: Analysis of user profiles in social networks to search for promis-
    ing entrants. In: INTED2017: 11th International Technology, Education and Development
    Conference, 5188-5194 (2017).
 8. Glazkova, A., Glazkov, M., Trifonov, T. g2tmn at Constraint@ AAAI2021: Exploiting
    CT-BERT and Ensembling Learning for COVID-19 Fake News Detection. arXiv preprint
    arXiv:2012.11967 (2020).
 9. Glazkova, A., Sokova, Z., Kruzhinov, V.: Predicting Political Views in Social Media:
    VKontakte as a case study. https://osf.io/preprints/27ku6/, last accessed 2020/12/22.
10. Glazkova, A.: A Comparison of Synthetic Oversampling Methods for Multi-class Text
    Classification. arXiv preprint arXiv:2008.04636 (2020).
11. Harris, Z. S.: Distributional structure. Word 10(2-3), 146-162 (1954).
12. Jiang, L., Yang, C. C.: User recommendation in healthcare social media by assessing user
    similarity in heterogeneous network. Artificial intelligence in medicine 81, 63-77 (2017).
13. Kim, J. H. et al.: Understanding Social Media Monitoring and Online Rumors. Journal of
    Computer Information System, 1–13 (2020).
14. Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian languages.
    In: International Conference on Analysis of Images, Social Networks and Texts, 320-332
    (2015).
15. Krylova, I. et al.: Languages of Russia: Using social networks to collect texts. In: Russian
    summer school in information retrieval, 179-185 (2015).
16. Kuratov, Y., Arkhipov, M.: Adaptation of Deep Bidirectional Multilingual Transformers
    for Russian Language. arXiv preprint arXiv:1905.07213 (2019).
17. Kutuzov, A., Kuzmenko, E.: WebVectors: a toolkit for building web interfaces for vector
    semantic models. In: International Conference on Analysis of Images, Social Networks
    and Texts, LNCS, 155–161, Springer, Cham (2016).
18. Lemaître, G., Nogueira, F., Aridas, C. K.: Imbalanced-learn: A python toolbox to tackle
    the curse of imbalanced datasets in machine learning. The Journal of Machine Learning
    Research, 18(1), 559-563 (2017).
19. Loper, E., Bird, S.: NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002).
20. Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. arXiv preprint
    arXiv:1711.05101 (2017).
21. Lu, Y. et al.: Understanding health care social media use from different stakeholder per-
    spectives: a content analysis of an online health community. Journal of medical Internet
    research, 4(19), e109 (2017).
22. Mikolov, T. et al.: Distributed representations of words and phrases and their composition-
    ality. Advances in neural information processing systems, 26, 3111-3119 (2013).
23. Paszke, A. et al.: Pytorch: An imperative style, high-performance deep learning library.
    Advances in neural information processing systems, 8026-8037 (2019).
24. Pedregosa, F. et al.: Scikit-learn: Machine learning in Python. The Journal of machine
    Learning research, 12, 2825-2830 (2011).
25. Řehůřek, R., Sojka, P.: Gensim—statistical semantics in python. Retrieved from gen-
    ism.org (2011).
26. Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings using Siamese BERT-
    Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
    guage Processing and the 9th International Joint Conference on Natural Language Process-
    ing (EMNLP-IJCNLP), 3973-3983 (2019).
27. Shahzad B. et al.: Discovery and classification of user interests on social media. Informa-
    tion Discovery and Delivery (2017).
28. Suh, Y. et al.: A comparison of oversampling methods on imbalanced topic classification
    of Korean news articles. Journal of Cognitive Science, 18(4), 391-437 (2017).
29. Walsh, J. P.: Social media and border security: Twitter use by migration policing agencies.
    Policing and Society, 10(30), 1138-1156 (2020).
                                                                                              7


30. Wolf, T. et al.: Transformers: State-of-the-art natural language processing. In: Proceedings
    of the 2020 Conference on Empirical Methods in Natural Language Processing: System
    Demonstrations, 38--45 (2020).
31. Xu, X. et al.: Business intelligence in online customer textual reviews: Understanding con-
    sumer perceptions and influential factors. International Journal of information manage-
    ment, 6(37), 673-683 (2017).
32. Yang, Y. et al.: Multilingual universal sentence encoder for semantic retrieval. arXiv pre-
    print arXiv:1907.04307 (2019).
33. Zhang, Z., Gupta, B. B.: Social media security and trustworthiness: overview and new di-
    rection. Future Generation Computer Systems, 86, 914-925 (2018).