Analysis of Big Five Personality Traits by Processing of Social Media Users Activity Features © Maxim Stankevich © Ivan Smirnov Institute for Systems Analysis, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, Moscow, Russia RUDN University, Moscow, Russia stankevich@isa.ru ivs@isa.ru © Nikolay Ignatiev RUDN University, Moscow, Russia naignatiev@yandex.com © Oleg Grigoriev Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, Moscow, Russia oleggpolikvart@yandex.ru © Natalia Kiselnikova Psychological Institute of Russian Academy of Education, Moscow, Russia nv.pirao@gmail.com Abstract. The study focused on the analysis of relation between Big Five personality traits of a user and his activity in popular Russian social media Vkontakte. In order to receive Big Five personality trait scores, we asked Vkontakte users to complete a psychological survey and then analyzed data from their personal public social media pages. The purpose of the study was to investigate the relation between social media activity features and users’ level of neuroticism, conscientiousness, extraversion, openness to experience and agreeableness. To perform the task, we used machine learning classification algorithms. Keywords: social media analysis, big five personality traits, machine learning, classification. 1 Introduction conscientiousness, extraversion, openness to experience and agreeableness by using machine learning algorithms. The Big Five personality traits model is a popular In order to form the dataset, we asked volunteers to psychological tool, which is commonly used for complete NEO-FFI questionnaire [4] and then to provide describing the human personality through the following access to their public pages information under privacy measurements: neuroticism, conscientiousness, constraints. Thus, we received data of 165 users from popular extraversion, openness to experience and agreeableness Russian social network Vkontakte. We presented five [1]. Personality traits scores are usually calculated with personality traits scores on the following scale: low level, the help of questionnaires. Widespread use of social medium, and high. The idea is to present the problem as media makes it possible to receive information about multiclass classification. Classification features are based on social media users by analyzing data retrieved from their a 1-year period of user activity represented as posts on their public pages. However, there are only a few studies public pages and general information about users’ profiles related to the analysis of users’ Big Five personality traits such as gender and a total number of friends and followers. by using social media activity information from Russian- To evaluate methods, we ran two sets of experiments with speaking social networks. A number of researchers are different classifiers: support vector machine and random involved in Big Five personality traits prediction and forest. analysis for English-speaking social networks [2,3], but The main issue that we faced with was a lack of there are no in-depth studies for Russian. The proposed training examples. Though because of insufficient data approach and the dataset thus collected are new for the we couldn’t significantly improve classification Russian social network analysis. The purpose of the performance, we came to the conclusion that feature study was to investigate the relation between social format should be redesigned and text analysis-based media activity features and user’s level of neuroticism, features should be added. We continue data collection and look forward to improve our results in the nearest future. Proceedings of the XX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL’2018), Moscow, Russia, October 9-12, 2018 162 2 Related works personal attributes. The model achieved high prediction performance for personal attributes such There is a lot of studies that investigate social media- as gender, age, and nationality (~80% Area Under based data usage for classification in different Curve) and about 35% of accuracy score for users’ psychology related tasks. Big Five personality traits. Besides Big Five personality analysis, detection of The research presented in [15] has a similar with depression, post-traumatic stress disorder and anxiety, is our work idea. Authors collected users’ data from also a very important problem. For example, CLPsych Vkontakte and performed correlation analysis using 2015 Shared Task organizers built the dataset consisting social media activity indicators. The main interest of messages collections of depressed and non-depressed was on photos published on the users’ public pages. users and asked contributors to share the performance of According to the results, most significant their depression detection models [5]. This shared task as correlations were found between extraversion and well as the other similar studies, such as [6] and [7] used such activity indicators as a number of friends and textual data natural language processing methods to form followers, total numbers of posts and some photo features for a predictive model. Authors of related works information-based indicators. Neuroticism score [8] and [9] used a social media activity features to also showed valuable positive correlation with a improve classification performance. One should take users’ total number of posts. into account that depression detection task is time- We analyzed related works and came up to the dependent – it is necessary to consider time constraints following conclusion. The background studies while dataset preparation, at the same time Big Five propose valuable methodologies for Big Five personality traits are more consistent in time [10]. personality traits analysis and prediction which are One of the most significant studies, related to mainly related to language use of English-speaking social media language and Big Five personality social media users. For Russian-speaking social traits, presented in [3]. The authors performed networks this problem is not well studied. For analysis of 700 million words, phrases, and topic example, in 2007 myPersonality project 6 started to instances collected from Facebook messages of gather social media data and results of psychology 75,000 volunteers, who took a standard personality questionnaires from Facebook users. The huge test. The work demonstrates some important volume of this project was successfully used for dependencies between language use and users’ different academic studies. However, there are no personality attributes. For each personality trait, available and appropriate datasets based on they formed the list of related words which showed Russian-speaking social media. This is the main valuable correlations with a neuroticism, reason why we had to form our original for the task conscientiousness, extraversion, openness to of Big Five personality traits analysis of Vkontakte experience and agreeableness levels. users. The work presented in [11] describes the Big Five personality traits prediction models for Twitter 3 Dataset users. This research dataset contains most recent 2000 tweets of 279 volunteers. To perform the task To build the dataset we asked volunteers from authors decided to present personality traits scores Vkontakte to take part in a psychological survey and as values on a normalized 0-1 scale. The features complete NEO-FFI questionnaire. After this part, used were based on text analysis. The authors utilize we requested access to their public pages under Linguistic Inquiry and Word count tool [12] to privacy constraints. Finally, for those who provided produce statistics on 81 different features. The their acceptance and completed questionnaire we MRC Psycholinguistic Database [13] was used to collected all available information from their public retrieve features from users’ vocabulary. The profile pages. Overall, data from 165 profiles was authors also performed social media activity assembled. Personal information that can reveal the features. The results of correlation analysis revealed identity of a persons was removed from the data. that some of them had correlations with five-factor We divided collected data into two categories: personality model. The proposed models showed general information about users and information about 15% of mean absolute error on a normalized about user messages posted during the time period scale for each personality trait as a measurement of from January 2017. The first part contains such model prediction accuracy. features as - number of friends, number of Another work related to the task of Big Five followers, gender, number of followed groups and traits prediction described in [14]. The data for the communities, etc. The second part contains the text research contains information about likes of 58,466 of the users’ messages, timestamps, and numbers of volunteers from the Facebook social media. The likes, commentaries, and reposts (analog of a authors used decomposed User-Likes matrix with retweet on Twitter). logistic and linear regression classificators to It is worth mentioning, that we continue to expand predict users Big Five personality traits and other our dataset with new examples. This study is based on 6 mypersonality.org 163 the current amount of available data, but we consider this Table 1 Big Five personality traits label distribution number only as an intermediate stage. among users in the dataset. 4 Methods Low: 33.9 Neuroticism, % Medium: 49.6 4.1 Big five personality traits High: 16.3 Here we describe the methodology for the Big Five Low: 19.3 personality traits prediction. We also describe the personality scores representation and features that we Conscientiousness, % Medium: 55.7 extract from available data. High: 24.8 As a first step, we divide the initial NEO-FFI score scale (0-48) of each personality trait as following: low Low: 27.8 level (0-20), medium level (21-32) and high level (33-48) Extraversion, % Medium: 58.1 [16]. As a result, one of these three classes were assigned to each of user’s scores of neuroticisms, High: 13.9 conscientiousness, extraversion, openness to experience Low: 10.6 and agreeableness. Thus, the initial task is transformed to Openness to the task of multiclass classification. It should be noted Medium: 66.3 experience, % that such approach imposes some restrictions on High: 23.1 evaluation method. Figure 1 represents the class distribution among users’ level of extraversion. Low: 15.1 Agreeableness, % Medium: 73.3 High: 11.5 them binary values that represent if a user provided this information or not. We assume that these answers can indicate users’ general readiness to share their opinion with other people and that might be valuable for future analysis. As it was mentioned, we collected users’ messages from their public pages. We used information about likes, commentaries and reposts related to these messages to calculate their averaged values on a single post. The fact that for every user we collected messages posted during an equal time period allows as to use total number of assembled posts as a feature. The messages timestamps were used to calculate the proportion of Figure 1 Levels distribution for users’ extraversion users’ messages posted during night time (12 P.M – 6 score. A.M.). Despite the fact that the medium level covers the shortest score interval, Figure 1 illustrates that the majority of users fall into this class. The same situation is observed with other personality traits. The statistics for each of Big Five personality trait presented in Table 1. 4.1 Features The format of Vkontakte personal page provides a wide range of user information. We used gender, number of friends, number of followers, number of followed groups, number of photo, and number of audio tracks to form a users’ feature set. While filling out a Vkontakte personal page, users can provide their opinion on predefined question such as, how they relate to smoking or what is the most important in people and life. Our data include all the answers, but it is hard to present such information as a feature. Since these questions are not mandatory for Vkontakte users, we decided to assign Figure 2 Number of words in users’ messages. 164 Table 2 Averaged results of multiple 4-fold cross-validation runs on the data. Random Forest Big Five trait Recall, % Precision, % F1-score, % Neuroticism 49.07 53.01 49.51 Conscientiousness 35.19 37.12 35.46 Extraversion 46.41 46.79 46.38 Openness to experience 44.65 47.50 45.46 Agreeableness 51.15 56.04 53.15 SVM Big Five trait Recall, % Precision, % F1-score, % Neuroticism 33.88 49.17 33.38 Conscientiousness 37.54 41.69 36.17 Extraversion 40.02 47.04 41.78 Openness to experience 32.28 52.47 35.07 Agreeableness 38.10 57.59 43.26 However, Vkontakte profiles in personal pages precision, and f1-score to present classification provide much less text data than Facebook and Tweeter. performance. To evaluate the accuracy of our models we The most popular format of Vkontakte users’ activity is compiled 10 runs of 4-fold cross-validation on the data. reposting. A large amount of communities provides The results of our experiments presented as an averaged different kind of content and users usually only repost value of these runs for each metric. The multiclass this content on their personal pages without giving any classification results with a 4-fold cross-validation commentaries or opinions. Overall, we collected 13152 presented in Table 2. The best values for each metric posts, but majority of them were empty reposts. Only highlighted in bold. 2637 of them contain texts written by users themselves. The best performance was shown for the The total number of used words for each user is presented agreeableness and neuroticism with a 49% and 53% of on Figure 2. f1-score respectively. The slightly worse results were As we can see on the Figure 2, current data contains received for extraversion and openness to experience a very limited amount of information about Vkontakte with a 45% and 46% of f1-score. Random forest language. Considering this, we decided to perform classification algorithm was used to get these results. The classification without language analysis. It is necessary conscientiousness personal trait performance was the to collect much more data before applying text analysis lowest in our experiments with only 36% of f1-score and compiling text-based features. In this work, we received by SVM. It is worth to note that in the most perform classification task using mostly social media cases SMV achieved more precision than RF, but recall activity features. score was significantly less. Despite this fact that we ignored lexical features in In general, we can’t define considered performance this research, we processed messages data to form as good. However, limited information about language several additional features. For example, the average use of Vkontakte users prevented the possibility to number of sentences and words. We also computed the compile lexical features and perform text analysis. proportion of uppercase words as well as the number of According to the results of studies based on English- ellipses in the users’ writings. We assume that described speaking social media, text features might serve as an features could reveal some specifics of people’s behavior effective revealing tool for users Big Five personality in social media. traits. Thus, in this study, we mostly tested social media activity features, which we can describe as being useful 5 Results of experiments for the considered task. The following chapter represents the results of our 6 Conclusion experiments. To perform the evaluations, we used scikit- learn implementation of random forest and multiclass In this work, we performed the prediction of Big Five SVM algorithms [17]. The parameters for the personality traits of social media users. We collected classification were set up by grid-search with 4-fold results of NEO-FFI questionnaire taken by 165 cross-validation. volunteers and compiled dataset using social media We calculated the macro variation of recall, activity information from their personal pages. The 165 personality traits scores were represented as low, Approach to Monitoring Clinical Depressive medium, and high levels to transform the task into Symptoms in Social Media. In Proceedings of multiclass classification. the 2017 IEEE/ACM International Conference We can define two limitations that we faced during on Advances in Social Networks Analysis and our work. The first one consists of the fact that Vkontakte Mining 2017 (pp. 1191-1198). ACM. users’ messages provide a very small amount of text data. [7] Jamil, Z. (2017). Monitoring Tweets for We observed that collected messages, for the most part, Depression to Detect At-risk Users (Doctoral are empty reposts, which don’t provide any text written dissertation, Université d'Ottawa/University of by users personally. This limitation imposes some Ottawa). restriction on our current study. The features for the [8] De Choudhury, M., Counts, S., & Horvitz, E. classification were compiled by processing of social (2013, May). Social media as a measurement media activity information without any lexical features. tool of depression in populations. In We assume that such features can greatly improve Proceedings of the 5th Annual ACM Web classification results. The second limitation is a simple Science Conference (pp. 47-56). ACM. lack of examples in our current dataset. [9] Wang, X., Zhang, C., Ji, Y., Sun, L., Wu, L., & Considering this limitation, we can admit that our Bao, Z. (2013, April). A depression detection most important task now is to add much more new model based on sentiment analysis in micro- examples to the dataset. With a greater size of data, we blog social network. In Pacific-Asia Conference can utilize text analysis approaches and investigate the on Knowledge Discovery and Data Mining (pp. relation between Big Five personality traits and Russian- 201-213). Springer, Berlin, Heidelberg. speaking social media language, which is currently an unresearched field of study. [10] Cobb-Clark, D. A., & Schurer, S. (2012). The stability of big-five personality traits. Acknowledgments. This work was financially Economics Letters, 115(1), 11-15. supported by the Ministry of Education and Science of [11] Golbeck, J., Robles, C., Edmondson, M., & the Russian Federation. Grant No. 14.604.21.0194 Turner, K. (2011, October). Predicting (Unique Project Identifier RFMEFI60417X0194) personality from twitter. In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third References Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International [1] Gosling, S. D., Rentfrow, P. J., & Swann Jr, W. Conference on (pp. 149-156). IEEE. B. (2003). A very brief measure of the Big-Five [12] Pennebaker, J. W., Francis, M. E., & Booth, R. personality domains. Journal of Research in J. (2001). Linguistic inquiry and word count: personality, 37(6), 504-528. LIWC 2001. Mahway: Lawrence Erlbaum [2] Ortigosa, A., Carro, R. M., & Quiroga, J. I. Associates, 71(2001), 2001. (2014). Predicting user personality by mining [13] Coltheart, M. (1981). The MRC social interactions in Facebook. Journal of psycholinguistic database. The Quarterly computer and System Sciences, 80(1), 57-71. Journal of Experimental Psychology Section A, [3] Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., 33(4), 497-505. Dziurzynski, L., Ramones, S. M., Agrawal, [14] Kosinski, M., Stillwell, D., & Graepel, T. M., ... & Ungar, L. H. (2013). Personality, (2013). Private traits and attributes are gender, and age in the language of social media: predictable from digital records of human The open-vocabulary approach. PloS one, 8(9), behavior. Proceedings of the National Academy e73791. of Sciences, 110(15), 5802-5805. [4] Costa, P. T., & McCrae, R. R. (1989). NEO [15] Shchebetenko, A. (2013). Big Five and usage of five-factor inventory (NEO-FFI). Odessa, FL: the VK online social network. Bulletin of South Psychological Assessment Resources. Ural State University, Series “Psychology” (pp. [5] Coppersmith, G., Dredze, M., Harman, C., 73-83). Hollingshead, K., & Mitchell, M. (2015). [16] Costa, P. T., & McCrae, R. R. (1992). Normal CLPsych 2015 shared task: Depression and personality assessment in clinical practice: The PTSD on Twitter. In Proceedings of the 2nd NEO Personality Inventory. Psychological Workshop on Computational Linguistics and assessment, 4(1), 5. Clinical Psychology: From Linguistic Signal to [17] Pedregosa, F., Varoquaux, G., Gramfort, A., Clinical Reality (pp. 31-39). Michel, V., Thirion, B., Grisel, O., ... & [6] Yazdavar, A. H., Al-Olimat, H. S., Ebrahimi, Vanderplas, J. (2011). Scikit-learn: Machine M., Bajaj, G., Banerjee, T., Thirunarayan, K., ... learning in Python. Journal of machine learning & Sheth, A. (2017, July). Semi-Supervised research, 12(Oct), 2825-2830 166