Empathic inclination from digital footprints* Marco Polignano, Pierpaolo Basile, Gaetano Rossiello, Marco de Gemmis, and Giovanni Semeraro University of Bari “Aldo Moro”, Dept. of Computer Science name.surname@uniba.it Abstract. ⋆ The large amount of personal data left by users on the Internet is a valuable source of information for improving the efficacy of profiling tasks. In particular, the data collected from social media can disclose personal habits, preferences and affective traits. The study is focused on the emphatic inclination of a subject, i.e. the ability to feel and share another person’s emotions, which can be a relevant aspect to consider in retrieval or recommendation processes. To support this idea, a model was proposed to predict its level and to emphasize the correlations with explicit features that characterize the user. Keywords: Social medium footprint, Empathy, Machine Learning 1 Background and Motivations The massive spread of social media over mobile devices has significantly changed the way people communicate today. The interaction with social media allows a person’s to feed her digital identity with preferences, interests, aptitudes. That information, usually known as social media footprints, is available on the web and can be exploited by others to discover that person’s tendencies, styles of life, and also affective and psychological traits [3, 7]. For this reason, we want to investigate whether (and how) it is possible to predict the empathy inclination of a user. We believe that personalization systems working in some specific domains, such as movie or music recommendation, would benefit from the knowledge of this affective aspect of the user. According to Hogan [2], empathy can be correlated with social self-confidence, even-temperedness, sensitivity and nonconformity. Therefore, a subject who shows high empathy is a very emotional and sensitivity person because not only she is inclined to understand others’ emotions, but she is also able to feel some strong emotions for them. 2 Empathy Inclination Prediction Model The proposed model is based on the idea that several aspects of the user life might contribute to infer user inclination to empathy. We exploit several kinds ⋆ These results are already published in “Learning Inclination to Empathy from Social Media Footprints” in proceedings of User Modelling, Adaptation and Personalization, FIIT STU, Bratislava, Slovakia, July 2017 (UMAP 2017) 2 Marco Polignano et al. of features, as sketched in Fig.1, to predict an empathy score by different linear regression models. Fig. 1. Empathy prediction model Each user 𝑈𝑖 is represented as the concatenation of five features vectors. Each vector captures a particular aspect of the user profile. User’s preferences are obtained by analyzing her likes grouped by topics over social media and they include likes over pages, artists, movies and many other topics of interest. The representation used is the SVD [5] for the representation through relevant combinations of concepts and LDA [1] for a combination of descriptive topics. The posts are analyzed by a pipeline performing basic NLP operations (we adopted TweetNLP as tokenizer: http://www.cs.cmu.edu/~ark/TweetNLP/), as well as operations for annotating emoticon and for removing character repetitions longer than two inside words. In order to capture the semantics behind the words, we use the word2vec algorithm [6] over all the textual posts in the collection for learning 200-dimension vectors, by considering only words that occur at least 10 times and 10 epochs of learning. Moreover, we divide the whole vocabulary of word2vec vectors into clusters, which should represent topics of discussion. 3 Experimental Session The aim of the experiment is to predict the user’s empathy by exploiting informa- tion explicitly available on her Facebook profile, as well as implicit information that can be inferred, as explained in Sec. 2. Moreover, we want to identify which groups of features are more important for obtaining an accurate prediction, by discovering relevant correlations among empathy and user’s features. More precisely, we formulated the following research questions: – RQ1. Is it possible to predict empathy from social media footprints? – RQ2. What are the most important features to consider for improving the prediction accuracy? The dataset used in the experiment, proposed by Kosinski [4], contains information about 4 millions of Facebook users. Data are collected using the “myPersonality” Facebook application. We removed those users who have not terminated the questionnaire or who were not linkable to other data (Demographic, Personality Traits, Activity, Status), after this step, the dataset is composed by 903 users, 178, 766 status updates. The range of the empathy value is 0-80. We Empathic inclination from digital footprints* 3 exploit three different regression algorithms: 1) Linear Regression (𝐿𝑟), 2) Simple Regression (𝑆𝑟) , 3) different configurations of kernel of the SVM Regression with SMO algorithm (𝑆𝑀 𝑂) . For the 𝑆𝑀 𝑂 we used the polynomial kernel (𝑆𝑀 𝑂𝑝𝑜𝑙𝑦 ) and the Radial Basis Function (RBF) kernel (𝑆𝑀 𝑂𝑟𝑏𝑓 ), by varying the 𝑐 parameter from 1 to 8. We propose two simple baselines in order to compare the proposed approach with alternative options. The former always predicts the most frequent value in the dataset (Majority, Value Predicted= 8, MAE= 7.4784, RMSE= 10.8258 ), while the latter computes the empathy score as the simple average of EQS observed in the dataset (Avg EQS, Value Predicted= 13.9169, MAE= 6.8457, RMSE= 9.0757). As for the evaluation metrics, we adopted the Root Mean Square Error (RMSE) and the Mean Absolute Error (MAE).The evaluation protocol was 10 folds cross validation. 4 Discussion of Results We execute a first experiment by running 𝐿𝑟, 𝑆𝑟, and 𝑆𝑀 𝑂 by using all the features of the dataset (1088 features in total). We compared the results in Tab. 1 with our baselines observing that using 𝑆𝑀 𝑂 with a polynomial kernel is not a good choice, having a large number of features. On the contrary, 𝑆𝑀 𝑂 with an RBF kernel is able to overcome both the baselines by setting 𝑐 = 1 (𝑀 𝐴𝐸 = 5.9101, 𝑅𝑀 𝑆𝐸 = 8.2341). These results allow us to answer positively to RQ1. Interesting results are obtained by 𝑆𝑟. MAE and RMSE are better than the baselines, despite this algorithm creates a regression function considering only the feature with higher variance in the dataset. Due to these findings, we decided to perform feature selection. We exploit the correlation-based feature subset Table 1. Relevant results of empathy level prediction All Features Filtered Features Approach c MAE RMSE MAE RMSE 𝑆𝑀 𝑂𝑝𝑜𝑙𝑦 1 12.7137 19.1565 5.714 7.8407 𝑆𝑀 𝑂𝑟𝑏𝑓 2 5.9543 8.2432 5.6673 7.8631 𝑆𝑀 𝑂𝑟𝑏𝑓 8 6.539 8.7748 5.686 7.8236 𝐿𝑟 - 22.7929 34.4679 5.7854 7.7269 𝑆𝑟 - 6.1045 8.233 6.1045 8.233 selection for finding the set of “most informative” features for the prediction task. The selected features are those with high correlation with the prediction class and low correlation among them. We obtained a set of 37 features. The best result in term of MAE (5.6673) is obtained by the 𝑆𝑀 𝑂𝑟𝑏𝑓 , with 𝑐 = 2. This configuration does not provide the best RMSE (7.8236) that it is achieved by 𝑆𝑀 𝑂𝑟𝑏𝑓 with 𝑐 = 8. For the 𝑆𝑀 𝑂𝑝𝑜𝑙𝑦 configuration, the best result for both MAE and RMSE is obtained with 𝑐 = 1 (5.714, 7.8407). It is interesting to note that results obtained by exploiting only selected features are better than both the baselines 4 Marco Polignano et al. and the runs over the whole set of features. Analyzing the features emerged after the selection process, we can note some interesting correlations among the semantics of them and the empathy inclination of the user. In particular, we observed that for an accurate prediction we have to consider the user’s religion (Nonreligious/Atheist), country (AG, EG, KW, HN, AR, SR), relationship_status (Separated), personality (extroversion, agreeableness) and some relevant word2vec clusters: cluster_1: game, team, soccer, battle, race, fans, bowling; cluster_13: dear, cheers, goody, extraordinaire, excitedly; cluster_21: personality, motivation, destiny, ability, vision; cluster_24: facebook, phone, message, internet, video. These correlations can be used as hints for user profiling and partially provide an answer for RQ2, therefore we decided to perform an ablation analysis for further investigation. We selected the best configuration 𝑆𝑀 𝑂𝑟𝑏𝑓 with 𝑐 = 1 and we removed one set of features at a time. By removing groups of features such as demographic, activity, LDA, we observed a slight change of MAE and RMSE. On the contrary, by removing the set of features about personality, a significant increase of both MAE (9.6308) and RMSE (9.0815) is observed. This provides a more specific answer for RQ2: personality traits are the key for effective empathy prediction. 5 Conclusion In this paper, we investigated the problem of mining social media footprints to infer the user’s inclination toward empathy. The main outcome of the experiments is a strong correlation is observed among empathy and personality traits. As a future work, we plan to include the findings described in this preliminary study as part of the user profile and to include them in a recommendation strategy. References 1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning research 3(Jan), 993–1022 (2003) 2. Hogan, R.: Development of an empathy scale. Journal of consulting and clinical psychology 33(3), 307 (1969) 3. Jelenchick, L.A., Eickhoff, J.C., Moreno, M.A.: Facebook depression? social network- ing site use and depression in older adolescents. Journal of Adolescent Health 52(1), 128–130 (2013) 4. Kosinski, M., Matz, S.C., Gosling, S.D., Popov, V., Stillwell, D.: Facebook as a research tool for the social sciences: Opportunities, challenges, ethical considerations, and practical guidelines. American Psychologist 70(6), 543 (2015) 5. Landauer, T.K.: Latent semantic analysis. Wiley Online Library (2006) 6. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word represen- tations in vector space. arXiv preprint arXiv:1301.3781 (2013) 7. Skowron, M., Tkalčič, M., Ferwerda, B., Schedl, M.: Fusing social media cues: personality prediction from twitter and instagram. In: Proceedings of the 25th international conference companion on world wide web. pp. 107–108. International World Wide Web Conferences Steering Committee (2016)