-

Empathic inclination from digital footprints*

Marco Polignano

Pierpaolo Basile

Gaetano Rossiello

Marco de Gemmis

Giovanni Semeraro

Empathy Inclination Prediction Model

0 0 University of Bari “Aldo Moro”, Dept. of Computer Science

⋆The large amount of personal data left by users on the Internet is a valuable source of information for improving the efficacy of profiling tasks. In particular, the data collected from social media can disclose personal habits, preferences and affective traits. The study is focused on the emphatic inclination of a subject, i.e. the ability to feel and share another person's emotions, which can be a relevant aspect to consider in retrieval or recommendation processes. To support this idea, a model was proposed to predict its level and to emphasize the correlations with explicit features that characterize the user.

Social medium footprint Empathy Machine Learning

of features, as sketched in Fig. , to predict an empathy score by different linear regression models.

Each user is represented as the concatenation of five features vectors. Each vector captures a particular aspect of the user profile. User’s preferences are obtained by analyzing her likes grouped by topics over social media and they include likes over pages, artists, movies and many other topics of interest. The representation used is the SVD [ ] for the representation through relevant combinations of concepts and LDA [ ] for a combination of descriptive topics. The posts are analyzed by a pipeline performing basic NLP operations (we adopted TweetNLP as tokenizer: http://www.cs.cmu.edu/~ark/TweetNLP/), as well as operations for annotating emoticon and for removing character repetitions longer than two inside words. In order to capture the semantics behind the words, we use the word vec algorithm [ ] over all the textual posts in the collection for learning -dimension vectors, by considering only words that occur at least 10 times and 10 epochs of learning. Moreover, we divide the whole vocabulary of word vec vectors into clusters, which should represent topics of discussion.

Experimental Session

The aim of the experiment is to predict the user’s empathy by exploiting information explicitly available on her Facebook profile, as well as implicit information that can be inferred, as explained in Sec. . Moreover, we want to identify which groups of features are more important for obtaining an accurate prediction, by discovering relevant correlations among empathy and user’s features.

More precisely, we formulated the following research questions: – RQ . Is it possible to predict empathy from social media footprints? – RQ . What are the most important features to consider for improving the prediction accuracy?

The dataset used in the experiment, proposed by Kosinski [ ], contains information about 4 millions of Facebook users. Data are collected using the “myPersonality” Facebook application. We removed those users who have not terminated the questionnaire or who were not linkable to other data (Demographic, Personality Traits, Activity, Status), after this step, the dataset is composed by 903 users, 178, 766 status updates. The range of the empathy value is - . We exploit three different regression algorithms: ) Linear Regression ( ), ) Simple Regression ( with SMO algorithm ( ) , ) different configurations of kernel of the SVM Regression ) . For the we used the polynomial kernel ( .

), while the latter computes the empathy score as the simple average of EQS observed in the dataset (Avg EQS, Value Predicted= MAE= .

, RMSE= .

). As for the evaluation metrics, we adopted the Root Mean Square Error (RMSE) and the Mean Absolute Error (MAE).The . . , evaluation protocol was

folds cross validation.

Discussion of Results

We execute a first experiment by running , , and by using all the features of the dataset (

features in total). We compared the results in

with our baselines observing that using with a polynomial kernel is not a good choice, having a large number of features. On the contrary, with an RBF kernel is able to overcome both the baselines by setting = 1 ( = 5.9101,

= 8.2341). These results allow us to answer positively to RQ . Interesting results are obtained by . MAE and RMSE are better than the baselines, despite this algorithm creates a regression function considering only the feature with higher variance in the dataset. Due to these findings, we decided to perform feature selection. We exploit the correlation-based feature subset . . .

. selection for finding the set of “most informative” features for the prediction task. The selected features are those with high correlation with the prediction class and low correlation among them. We obtained a set of 37 features. The best result in term of MAE (5.6673) is obtained by the , with = 2. This configuration does not provide the best RMSE (7.8236) that it is achieved by with = 8. For the is obtained with

configuration, the best result for both MAE and RMSE = 1 (5.714, 7.8407). It is interesting to note that results obtained by exploiting only selected features are better than both the baselines and the runs over the whole set of features. Analyzing the features emerged after the selection process, we can note some interesting correlations among the semantics of them and the empathy inclination of the user. In particular, we observed that for an accurate prediction we have to consider the user’s religion (Nonreligious/Atheist), country (AG, EG, KW, HN, AR, SR), relationship_status (Separated), personality (extroversion, agreeableness) and some relevant word vec clusters: cluster_ : game, team, soccer, battle, race, fans, bowling; cluster_ : dear, cheers, goody, extraordinaire, excitedly; cluster_ : personality, motivation, destiny, ability, vision; cluster_ : facebook, phone, message, internet, video. These correlations can be used as hints for user profiling and partially provide an answer for RQ , therefore we decided to perform an ablation analysis for further investigation. We selected the best configuration with = 1 and we removed one set of features at a time. By removing groups of features such as demographic, activity, LDA, we observed a slight change of MAE and RMSE. On the contrary, by removing the set of features about personality, a significant increase of both MAE (9.6308) and RMSE (9.0815) is observed. This provides a more specific answer for RQ : personality traits are the key for effective empathy prediction.

Conclusion

In this paper, we investigated the problem of mining social media footprints to infer the user’s inclination toward empathy. The main outcome of the experiments is a strong correlation is observed among empathy and personality traits. As a future work, we plan to include the findings described in this preliminary study as part of the user profile and to include them in a recommendation strategy.