Analysis of Big Five Personality Traits by Processing of
               Social Media Users Activity Features
                  © Maxim Stankevich                             © Ivan Smirnov
  Institute for Systems Analysis, Federal Research Center “Computer Science and Control” of the
                           Russian Academy of Sciences, Moscow, Russia
                                 RUDN University, Moscow, Russia
                     stankevich@isa.ru                              ivs@isa.ru
                                          © Nikolay Ignatiev
                                 RUDN University, Moscow, Russia
                                       naignatiev@yandex.com
                                          © Oleg Grigoriev
  Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences,
                                           Moscow, Russia
                                      oleggpolikvart@yandex.ru
                                        © Natalia Kiselnikova
             Psychological Institute of Russian Academy of Education, Moscow, Russia
                                         nv.pirao@gmail.com

           Abstract. The study focused on the analysis of relation between Big Five personality traits of a user
     and his activity in popular Russian social media Vkontakte. In order to receive Big Five personality trait
     scores, we asked Vkontakte users to complete a psychological survey and then analyzed data from their
     personal public social media pages. The purpose of the study was to investigate the relation between social
     media activity features and users’ level of neuroticism, conscientiousness, extraversion, openness to
     experience and agreeableness. To perform the task, we used machine learning classification algorithms.
           Keywords: social media analysis, big five personality traits, machine learning, classification.


 1 Introduction                                                        conscientiousness, extraversion, openness to experience
                                                                       and agreeableness by using machine learning algorithms.
 The Big Five personality traits model is a popular                        In order to form the dataset, we asked volunteers to
 psychological tool, which is commonly used for                        complete NEO-FFI questionnaire [4] and then to provide
 describing the human personality through the following                access to their public pages information under privacy
 measurements:         neuroticism,       conscientiousness,           constraints. Thus, we received data of 165 users from popular
 extraversion, openness to experience and agreeableness                Russian social network Vkontakte. We presented five
 [1]. Personality traits scores are usually calculated with            personality traits scores on the following scale: low level,
 the help of questionnaires. Widespread use of social                  medium, and high. The idea is to present the problem as
 media makes it possible to receive information about                  multiclass classification. Classification features are based on
 social media users by analyzing data retrieved from their             a 1-year period of user activity represented as posts on their
 public pages. However, there are only a few studies                   public pages and general information about users’ profiles
 related to the analysis of users’ Big Five personality traits         such as gender and a total number of friends and followers.
 by using social media activity information from Russian-              To evaluate methods, we ran two sets of experiments with
 speaking social networks. A number of researchers are                 different classifiers: support vector machine and random
 involved in Big Five personality traits prediction and                forest.
 analysis for English-speaking social networks [2,3], but                  The main issue that we faced with was a lack of
 there are no in-depth studies for Russian. The proposed               training examples. Though because of insufficient data
 approach and the dataset thus collected are new for the               we couldn’t significantly improve classification
 Russian social network analysis. The purpose of the                   performance, we came to the conclusion that feature
 study was to investigate the relation between social                  format should be redesigned and text analysis-based
 media activity features and user’s level of neuroticism,              features should be added. We continue data collection
                                                                       and look forward to improve our results in the nearest
                                                                       future.
Proceedings of the XX International Conference
“Data Analytics and Management in Data Intensive
Domains” (DAMDID/RCDL’2018), Moscow, Russia,
October 9-12, 2018


                                                                 162
2 Related works                                                     personal attributes. The model achieved high
                                                                    prediction performance for personal attributes such
There is a lot of studies that investigate social media-            as gender, age, and nationality (~80% Area Under
based data usage for classification in different                    Curve) and about 35% of accuracy score for users’
psychology related tasks.                                           Big Five personality traits.
    Besides Big Five personality analysis, detection of                 The research presented in [15] has a similar with
depression, post-traumatic stress disorder and anxiety, is          our work idea. Authors collected users’ data from
also a very important problem. For example, CLPsych                 Vkontakte and performed correlation analysis using
2015 Shared Task organizers built the dataset consisting            social media activity indicators. The main interest
of messages collections of depressed and non-depressed              was on photos published on the users’ public pages.
users and asked contributors to share the performance of            According to the results, most significant
their depression detection models [5]. This shared task as          correlations were found between extraversion and
well as the other similar studies, such as [6] and [7] used         such activity indicators as a number of friends and
textual data natural language processing methods to form            followers, total numbers of posts and some photo
features for a predictive model. Authors of related works           information-based indicators. Neuroticism score
[8] and [9] used a social media activity features to                also showed valuable positive correlation with a
improve classification performance. One should take                 users’ total number of posts.
into account that depression detection task is time-                    We analyzed related works and came up to the
dependent – it is necessary to consider time constraints            following conclusion. The background studies
while dataset preparation, at the same time Big Five                propose valuable methodologies for Big Five
personality traits are more consistent in time [10].                personality traits analysis and prediction which are
    One of the most significant studies, related to                 mainly related to language use of English-speaking
social media language and Big Five personality                      social media users. For Russian-speaking social
traits, presented in [3]. The authors performed                     networks this problem is not well studied. For
analysis of 700 million words, phrases, and topic                   example, in 2007 myPersonality project 6 started to
instances collected from Facebook messages of                       gather social media data and results of psychology
75,000 volunteers, who took a standard personality                  questionnaires from Facebook users. The huge
test. The work demonstrates some important                          volume of this project was successfully used for
dependencies between language use and users’                        different academic studies. However, there are no
personality attributes. For each personality trait,                 available and appropriate datasets based on
they formed the list of related words which showed                  Russian-speaking social media. This is the main
valuable correlations          with     a    neuroticism,           reason why we had to form our original for the task
conscientiousness, extraversion, openness to                        of Big Five personality traits analysis of Vkontakte
experience and agreeableness levels.                                users.
    The work presented in [11] describes the Big
Five personality traits prediction models for Twitter               3 Dataset
users. This research dataset contains most recent
2000 tweets of 279 volunteers. To perform the task                  To build the dataset we asked volunteers from
authors decided to present personality traits scores                Vkontakte to take part in a psychological survey and
as values on a normalized 0-1 scale. The features                   complete NEO-FFI questionnaire. After this part,
used were based on text analysis. The authors utilize               we requested access to their public pages under
Linguistic Inquiry and Word count tool [12] to                      privacy constraints. Finally, for those who provided
produce statistics on 81 different features. The                    their acceptance and completed questionnaire we
MRC Psycholinguistic Database [13] was used to                      collected all available information from their public
retrieve features from users’ vocabulary. The                       profile pages. Overall, data from 165 profiles was
authors also performed social media activity                        assembled. Personal information that can reveal the
features. The results of correlation analysis revealed              identity of a persons was removed from the data.
that some of them had correlations with five-factor                     We divided collected data into two categories:
personality model. The proposed models showed                       general information about users and information
about 15% of mean absolute error on a normalized                    about user messages posted during the time period
scale for each personality trait as a measurement of                from January 2017. The first part contains such
model prediction accuracy.                                          features as - number of friends, number of
    Another work related to the task of Big Five                    followers, gender, number of followed groups and
traits prediction described in [14]. The data for the               communities, etc. The second part contains the text
research contains information about likes of 58,466                 of the users’ messages, timestamps, and numbers of
volunteers from the Facebook social media. The                      likes, commentaries, and reposts (analog of a
authors used decomposed User-Likes matrix with                      retweet on Twitter).
logistic and linear regression classificators to                        It is worth mentioning, that we continue to expand
predict users Big Five personality traits and other                 our dataset with new examples. This study is based on

6
     mypersonality.org


                                                              163
the current amount of available data, but we consider this            Table 1 Big Five personality traits label distribution
number only as an intermediate stage.                                 among users in the dataset.

4 Methods                                                                                       Low:              33.9
                                                                         Neuroticism, %       Medium:             49.6
4.1 Big five personality traits                                                                 High:             16.3
Here we describe the methodology for the Big Five                                               Low:              19.3
personality traits prediction. We also describe the
personality scores representation and features that we                Conscientiousness, % Medium:                55.7
extract from available data.                                                                    High:             24.8
    As a first step, we divide the initial NEO-FFI score
scale (0-48) of each personality trait as following: low                                        Low:              27.8
level (0-20), medium level (21-32) and high level (33-48)                Extraversion, %      Medium:             58.1
[16]. As a result, one of these three classes were assigned
to each of user’s scores of neuroticisms,                                                       High:             13.9
conscientiousness, extraversion, openness to experience                                         Low:              10.6
and agreeableness. Thus, the initial task is transformed to                Openness to
the task of multiclass classification. It should be noted                                     Medium:             66.3
                                                                          experience, %
that such approach imposes some restrictions on                                                 High:             23.1
evaluation method. Figure 1 represents the class
distribution among users’ level of extraversion.                                                Low:              15.1
                                                                        Agreeableness, %      Medium:             73.3
                                                                                                High:             11.5

                                                                      them binary values that represent if a user provided this
                                                                      information or not. We assume that these answers can
                                                                      indicate users’ general readiness to share their opinion
                                                                      with other people and that might be valuable for future
                                                                      analysis.
                                                                          As it was mentioned, we collected users’ messages
                                                                      from their public pages. We used information about
                                                                      likes, commentaries and reposts related to these
                                                                      messages to calculate their averaged values on a single
                                                                      post. The fact that for every user we collected messages
                                                                      posted during an equal time period allows as to use total
                                                                      number of assembled posts as a feature. The messages
                                                                      timestamps were used to calculate the proportion of
Figure 1 Levels distribution for users’ extraversion                  users’ messages posted during night time (12 P.M – 6
score.                                                                A.M.).

    Despite the fact that the medium level covers the
shortest score interval, Figure 1 illustrates that the
majority of users fall into this class. The same situation
is observed with other personality traits. The statistics for
each of Big Five personality trait presented in Table 1.

4.1 Features
The format of Vkontakte personal page provides a wide
range of user information. We used gender, number of
friends, number of followers, number of followed
groups, number of photo, and number of audio tracks to
form a users’ feature set. While filling out a Vkontakte
personal page, users can provide their opinion on
predefined question such as, how they relate to smoking
or what is the most important in people and life. Our data
include all the answers, but it is hard to present such
information as a feature. Since these questions are not
mandatory for Vkontakte users, we decided to assign                   Figure 2 Number of words in users’ messages.


                                                                164
Table 2 Averaged results of multiple 4-fold cross-validation runs on the data.

                                                  Random Forest
           Big Five trait                    Recall, %                    Precision, %                 F1-score, %
Neuroticism                                    49.07                          53.01                       49.51
Conscientiousness                              35.19                          37.12                       35.46
Extraversion                                   46.41                          46.79                       46.38
Openness to experience                         44.65                          47.50                       45.46
Agreeableness                                  51.15                          56.04                       53.15
                                                         SVM
           Big Five trait                    Recall, %                    Precision, %                 F1-score, %
Neuroticism                                    33.88                          49.17                       33.38
Conscientiousness                              37.54                          41.69                       36.17
Extraversion                                   40.02                          47.04                       41.78
Openness to experience                         32.28                          52.47                       35.07
Agreeableness                                  38.10                          57.59                       43.26


    However, Vkontakte profiles in personal pages                 precision, and f1-score to present classification
provide much less text data than Facebook and Tweeter.            performance. To evaluate the accuracy of our models we
The most popular format of Vkontakte users’ activity is           compiled 10 runs of 4-fold cross-validation on the data.
reposting. A large amount of communities provides                 The results of our experiments presented as an averaged
different kind of content and users usually only repost           value of these runs for each metric. The multiclass
this content on their personal pages without giving any           classification results with a 4-fold cross-validation
commentaries or opinions. Overall, we collected 13152             presented in Table 2. The best values for each metric
posts, but majority of them were empty reposts. Only              highlighted in bold.
2637 of them contain texts written by users themselves.               The best performance was shown for the
The total number of used words for each user is presented         agreeableness and neuroticism with a 49% and 53% of
on Figure 2.                                                      f1-score respectively. The slightly worse results were
    As we can see on the Figure 2, current data contains          received for extraversion and openness to experience
a very limited amount of information about Vkontakte              with a 45% and 46% of f1-score. Random forest
language. Considering this, we decided to perform                 classification algorithm was used to get these results. The
classification without language analysis. It is necessary         conscientiousness personal trait performance was the
to collect much more data before applying text analysis           lowest in our experiments with only 36% of f1-score
and compiling text-based features. In this work, we               received by SVM. It is worth to note that in the most
perform classification task using mostly social media             cases SMV achieved more precision than RF, but recall
activity features.                                                score was significantly less.
    Despite this fact that we ignored lexical features in             In general, we can’t define considered performance
this research, we processed messages data to form                 as good. However, limited information about language
several additional features. For example, the average             use of Vkontakte users prevented the possibility to
number of sentences and words. We also computed the               compile lexical features and perform text analysis.
proportion of uppercase words as well as the number of            According to the results of studies based on English-
ellipses in the users’ writings. We assume that described         speaking social media, text features might serve as an
features could reveal some specifics of people’s behavior         effective revealing tool for users Big Five personality
in social media.                                                  traits. Thus, in this study, we mostly tested social media
                                                                  activity features, which we can describe as being useful
5 Results of experiments                                          for the considered task.

The following chapter represents the results of our               6 Conclusion
experiments. To perform the evaluations, we used scikit-
learn implementation of random forest and multiclass              In this work, we performed the prediction of Big Five
SVM algorithms [17]. The parameters for the                       personality traits of social media users. We collected
classification were set up by grid-search with 4-fold             results of NEO-FFI questionnaire taken by 165
cross-validation.                                                 volunteers and compiled dataset using social media
    We calculated the macro variation of recall,                  activity information from their personal pages. The


                                                            165
personality traits scores were represented as low,                       Approach to Monitoring Clinical Depressive
medium, and high levels to transform the task into                       Symptoms in Social Media. In Proceedings of
multiclass classification.                                               the 2017 IEEE/ACM International Conference
    We can define two limitations that we faced during                   on Advances in Social Networks Analysis and
our work. The first one consists of the fact that Vkontakte              Mining 2017 (pp. 1191-1198). ACM.
users’ messages provide a very small amount of text data.           [7] Jamil, Z. (2017). Monitoring Tweets for
We observed that collected messages, for the most part,                  Depression to Detect At-risk Users (Doctoral
are empty reposts, which don’t provide any text written                  dissertation, Université d'Ottawa/University of
by users personally. This limitation imposes some                        Ottawa).
restriction on our current study. The features for the              [8] De Choudhury, M., Counts, S., & Horvitz, E.
classification were compiled by processing of social                     (2013, May). Social media as a measurement
media activity information without any lexical features.                 tool of depression in populations. In
We assume that such features can greatly improve                         Proceedings of the 5th Annual ACM Web
classification results. The second limitation is a simple                Science Conference (pp. 47-56). ACM.
lack of examples in our current dataset.
                                                                    [9] Wang, X., Zhang, C., Ji, Y., Sun, L., Wu, L., &
    Considering this limitation, we can admit that our
                                                                         Bao, Z. (2013, April). A depression detection
most important task now is to add much more new
                                                                         model based on sentiment analysis in micro-
examples to the dataset. With a greater size of data, we
                                                                         blog social network. In Pacific-Asia Conference
can utilize text analysis approaches and investigate the
                                                                         on Knowledge Discovery and Data Mining (pp.
relation between Big Five personality traits and Russian-
                                                                         201-213). Springer, Berlin, Heidelberg.
speaking social media language, which is currently an
unresearched field of study.                                        [10] Cobb-Clark, D. A., & Schurer, S. (2012). The
                                                                         stability of big-five personality traits.
Acknowledgments. This work was financially                               Economics Letters, 115(1), 11-15.
supported by the Ministry of Education and Science of               [11] Golbeck, J., Robles, C., Edmondson, M., &
the Russian Federation. Grant No. 14.604.21.0194                         Turner, K. (2011, October). Predicting
(Unique Project Identifier RFMEFI60417X0194)                             personality from twitter. In Privacy, Security,
                                                                         Risk and Trust (PASSAT) and 2011 IEEE Third
References                                                               Inernational Conference on Social Computing
                                                                         (SocialCom), 2011 IEEE Third International
   [1] Gosling, S. D., Rentfrow, P. J., & Swann Jr, W.                   Conference on (pp. 149-156). IEEE.
         B. (2003). A very brief measure of the Big-Five            [12] Pennebaker, J. W., Francis, M. E., & Booth, R.
         personality domains. Journal of Research in                     J. (2001). Linguistic inquiry and word count:
         personality, 37(6), 504-528.                                    LIWC 2001. Mahway: Lawrence Erlbaum
   [2]   Ortigosa, A., Carro, R. M., & Quiroga, J. I.                    Associates, 71(2001), 2001.
         (2014). Predicting user personality by mining              [13] Coltheart, M. (1981). The MRC
         social interactions in Facebook. Journal of                     psycholinguistic database. The Quarterly
         computer and System Sciences, 80(1), 57-71.                     Journal of Experimental Psychology Section A,
   [3]   Schwartz, H. A., Eichstaedt, J. C., Kern, M. L.,                33(4), 497-505.
         Dziurzynski, L., Ramones, S. M., Agrawal,                  [14] Kosinski, M., Stillwell, D., & Graepel, T.
         M., ... & Ungar, L. H. (2013). Personality,                     (2013). Private traits and attributes are
         gender, and age in the language of social media:                predictable from digital records of human
         The open-vocabulary approach. PloS one, 8(9),                   behavior. Proceedings of the National Academy
         e73791.                                                         of Sciences, 110(15), 5802-5805.
   [4]   Costa, P. T., & McCrae, R. R. (1989). NEO                  [15] Shchebetenko, A. (2013). Big Five and usage of
         five-factor inventory (NEO-FFI). Odessa, FL:                    the VK online social network. Bulletin of South
         Psychological Assessment Resources.                             Ural State University, Series “Psychology” (pp.
   [5]   Coppersmith, G., Dredze, M., Harman, C.,                        73-83).
         Hollingshead, K., & Mitchell, M. (2015).                   [16] Costa, P. T., & McCrae, R. R. (1992). Normal
         CLPsych 2015 shared task: Depression and                        personality assessment in clinical practice: The
         PTSD on Twitter. In Proceedings of the 2nd                      NEO Personality Inventory. Psychological
         Workshop on Computational Linguistics and                       assessment, 4(1), 5.
         Clinical Psychology: From Linguistic Signal to             [17] Pedregosa, F., Varoquaux, G., Gramfort, A.,
         Clinical Reality (pp. 31-39).                                   Michel, V., Thirion, B., Grisel, O., ... &
   [6]   Yazdavar, A. H., Al-Olimat, H. S., Ebrahimi,                    Vanderplas, J. (2011). Scikit-learn: Machine
         M., Bajaj, G., Banerjee, T., Thirunarayan, K., ...              learning in Python. Journal of machine learning
         & Sheth, A. (2017, July). Semi-Supervised                       research, 12(Oct), 2825-2830


                                                              166