What do your look-alikes say about you? Exploiting strong and weak similarities for author profiling. Notebook for PAN at CLEF 2015 Piotr Przybyła and Paweł Teisseyre Institute of Computer Science, Polish Academy of Sciences Jana Kazimierza 5, 01-248 Warsaw, Poland p.przybyla@phd.ipipan.waw.pl teisseyrep@ipipan.waw.pl Abstract We describe a two-step procedure for author profiling, which first ex- ploits language similarities between users and then aims at discovering more complex dependencies for dissimilar users. The method is motivated by the fact that authors using very similar vocabulary are likely to have similar traits. We use both word-based and text-based features, as well as relying on previous re- search. The proposed approach gives successful results, especially for gender and age prediction. Moreover, we show the most useful features using relevance mea- sures based on random forests. 1 Introduction This paper outlines our approach to author profiling task at the 13th PAN evaluation lab on uncovering plagiarism, authorship, and social software misuse [8]. The goal is to analyse a collection of tweets (in English, Spanish, Dutch and Italian) and discover its author’s gender, age and personality traits: extraversion, stability, agreeableness, consci- entiousness and openness. Unfortunately, the available amount of training data is very small: from 34 users for Dutch to 152 for English. As it seems very unlikely to observe new significant dependencies in such sets, we have decided to generate features basing on a collection of lexicons obtained in previous works. What is more, we have observed that authors using very similar vocabulary (the look-alikes) tend to have identical traits. We exploit this fact by performing a two-step prediction procedure: classifying a new item starts by finding a close neighbour; a full prediction model is used only in case nothing close enough could be found. 2 Features In our approach, two groups of features are used: word-based and text-based. The word-based features represent numbers of occurrences of lemmas obtained with multi- language TreeTagger [10]. The text-based features, computed as global statistics of text, include the following: – length – average tweet length (number of characters), – wordLength – average word length, – urls – average number of URLs per tweet1 , – hashtags – number of hashtags, – citations – number of citations (@username), – capitals – fraction of capital letters, – exclamations – number of exclamation marks, – questions – number of question marks, – emoticonsPos – number of positive emoticons (recognized by a regular expres- sion: " [:;]\S*[\)DpP\]\ *]"), – emoticonsNeg – number of negative emoticons (recognized by a regular expres- sion: " :\S*[\(/\\\ |C]"), – repeatedLetters – fraction of repeated letters, – repeatedMarks – fraction of repeated exclamation and question marks, – numbers – number of numerical expressions (recognized by a regular expression: " \d+([\.,]\d+)*"), – errors – number of spelling errors (obtained using multi-language Language- Tool), – yuleK – vocabulary size estimated using Yule’s K [16]. To improve the predictions, we have also taken into account previous research on text- based prediction of sentiment, emotions, etc. by including the following lexical features: – for all languages: SSPositive/SSNegative – positive/negative sentiment score of collection of tweets, using SentiStrength tool [13], – for English: • NRCEmotion * – numerical value of 10 emotion associations (averaged per word2 ), using NRC Word-Emotion Association Lexicon [4], • NRCTwitterSentiment – sentiment value, using NRC Twitter Sentiment Lexicon [2], • NRCHashtagSentiment(140) – sentiment value, using NRC Hashtag Emo- tion Lexicon and Sentiment140 lexicon [2], • LexiconAFINN – sentiment value, using AFINN Lexicon [6], • MRC * – features from the MRC base [15]: familiarity, concreteness, imagery, meaningfulness (two measures) and age of acquisition, • WWBPLexAge and WWBPLexGender – usage of age- and gender-dependent lexicons from World Well-Being Project (WWBP) [9], • WWBPAll* – correlations with author features: gender, age and personality using data from WWBP [11], – for Spanish: SELEmotion * – numerical value of one of 6 emotions (joy, anger, fear, disgust, surprise, sadness), using Spanish Emotion Lexicon [12], – for Dutch: NLEmotion * – numerical value of valence, arousal, dominance and age of acquisition, using lexicon [5], In total, we have obtained 56 features. Unfortunately, a great deal of them provides information only in case of English texts. 1 All subsequent numbers are also averaged per tweet, unless noted otherwise. 2 All subsequent values are also averaged per word. 3 Prediction To predict traits (gender, age and personality) of Twitter users we apply a simple two- step procedure. The idea is to start with exploring close similarities between writings, and then try to discover more complex dependencies. More specifically, to predict traits for a new user, we first find the most similar user in the training data. If the similarity is sufficiently close, we assign traits of the found user to the new user. Otherwise, we use an advanced classification model to predict the traits. This approach is motivated by the fact that among large number of tweets one can easily find messages written by the same user. Moreover, it may happen that one person sends tweets from different Twitter accounts. So-called multiple Twitter accounts, which allow to boost users’ presence in web, are becoming more and more popular. Finally, a very similar vocabulary can be shared by certain groups of users, having also similar features. 0.010 0.012 between concordant users between concordant users between discordant users between discordant users 0.008 Probability density functions Probability density functions 0.008 0.006 0.004 0.004 0.002 0.000 0.000 0 200 400 600 800 1000 0 100 200 300 400 distances distances (a) (b) Figure 1. Smoothed histograms of distances between users for English (a) and Spanish (b). Figure 1 shows normalised smoothed histograms of distances between concordant (having the same traits) and discordant users for English and Spanish. Although the histograms are partly overlapping, it is clear that distances among the first group are usually much smaller that in the second one. The advanced classification model used in the second step allows to discover more complex dependencies. The details of the whole procedure are given below. Prediction Algorithm: 1. Finding similar users in training data. Here, we use two approaches, depending on the language of tweets. – For English we build a classification model in which identifier of a group of concordant users (having the same traits) is used as a class variable. As a clas- sification model, we use random forests [3], built on all available features. If the maximum of predicted probabilities for the new user is greater than a certain threshold pmin , we assign traits of a corresponding group to the new user. – For other languages we simply find a nearest neighbour of the new user in the training data. To determine nearest neighbour, we use Euclidean distance and all available features. If the distance is less than a certain threshold dmax , we assign traits of the nearest neighbour to the new user. 2. Prediction for dissimilar users. If no similar users in training data are found, i.e. predicted probability of the best group is smaller that pmin (for English) or the distance to the nearest neighbour is greater than dmax (for other languages), we apply random forest method to predict each trait separately. We use all available features except word-based. For gender and age, decision trees are taken as base learners, whereas for personal traits regression trees are used. Other classification algorithms have also been tested (e.g. logistic regression) but they have yielded poorer results. Observe that above procedure depends on the choice of threshold. If pmin is suf- ficiently small (for English) or dmax sufficiently large (for other languages), all users from training data are recognized as similar users and therefore only the first step of the above procedure is run. In the opposite case the full prediction model is always employed. To calibrate a threshold, we randomly split data (30 times) into training and testing parts and then compute averaged accuracy (gender and age) and mean error – RMSE (personal traits) for different values of threshold. Figure 2 shows the results for English and Spanish. There is a clear optimum (maximum accuracy or minimum RMSE) for certain value of threshold. Note that for English the optimal value is common for all traits and equals pmin ≈ 0.12. For Spanish an optimum is at dmax ≈ 90 for gender and personal traits, whereas in case of age it is better to use nearest neighbour approach to all users. For the remaining languages we always apply nearest neighbour method (i.e. set dmax = 0), as the training sets are to small to build complex models. 4 Results We have examined how the prediction procedure presented in Section 3 works with the set of features described in Section 2. As measures of performance we use accuracy (gender and age) and RMSE (personal traits). We randomly split data into training and testing parts in the following proportions: 75% for training and 25% for testing (for English and Spanish). For Italian and Dutch, due to small amount of data, we take only one observation for testing and the rest for training. The above procedure is repeated 30 times and the results are averaged over all runs. Classification procedure is implemented in R system [7] using libraries: randomForest [3], FNN [1] and class [14]. Results of our experiments (for an optimal value of threshold) are shown in Table 1. Numbers in brackets correspond to a baseline which is major class share (for classi- fication) and mean value (for regression), calculated on training data. The third column Gender Gender 0.80 0.80 0.81 0.82 0.83 0.84 0.85 0.86 ● ● ● ●● 0.79 ●● ● ● ● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● 0.78 ● Accuracy Accuracy ●●●●●●●●●●●●●●●●● 0.77 ● ● ● 0.76 ● ● 0.75 ● 0.74 ● ●●●● Baseline: 0.43 ●●●●●● Baseline: 0.44 0.0 0.1 0.2 0.3 0.4 0.5 0 50 100 150 200 250 300 threshold threshold Age Age ● ●●●●●●●●●● Baseline: 0.35 ●●●●●●● ● ● ● ● ● ● ●● 0.65 ● 0.740 ● ●●●● ● ● ● ● Accuracy Accuracy ● 0.60 ● ● ● 0.730 ● ● ● ● 0.55 ● 0.720 ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● Baseline: 0.45 0.0 0.1 0.2 0.3 0.4 0.5 0 50 100 150 200 250 300 threshold threshold Personal traits Personal traits 0.22 0.20 0.20 0.18 0.18 0.16 RMSE RMSE 0.16 0.14 0.14 0.12 0.12 0.10 0.10 E S A C O E S A C O 0.0 0.1 0.2 0.3 0.4 0.5 0 50 100 150 200 250 300 threshold threshold (a) (b) Figure 2. Accuracy and RMSE with respect to threshold for gender, age and components of personality, for English (a) and Spanish (b). Horizontal lines correspond to baseline. Table 1. Results of experiments; numbers in brackets correspond to baseline. Notation: E – Ex- traversion, S – Stability, A – Agreeableness, C – Conscientiousness, O – Openness. Accuracy RMSE Gender Age Gender&Age E S A C O Mean 0.798 0.748 0.659 0.136 0.179 0.139 0.136 0.120 0.143 English (0.432) (0.353) (0.215) (0.164) (0.227) (0.158) (0.152) (0.149) (0.170) 0.861 0.687 0.671 0.148 0.161 0.114 0.119 0.146 0.141 Spanish (0.437) (0.451) (0.247) (0.177) (0.209) (0.161) (0.192) (0.171) (0.182) 0.767 - - 0.133 0.060 0.040 0.060 0.102 0.074 Dutch (0.5) - - (0.158) (0.117) (0.126) (0.099) (0.142) (0.128) 0.900 - - 0.071 0.043 0.036 0.023 0.029 0.031 Italian (0.5) - - (0.121) (0.199) (0.097) (0.085) (0.117) (0.124) includes joint accuracy for gender and age, whereas the last column contains RMSE, averaged over 5 personal traits. First, note that all the results exceed baseline. It is seen that gender and age identification are successful: we obtain accuracy 77%-90% for gen- der and 69%-75% for age. Moreover, simultaneous prediction of these two traits is also possible: the accuracy is about 3 times larger than the baseline. Personality assessment is a much more challenging task. Our experiments indicate that it is difficult to obtain an error significantly below the baseline. Finally, we assess predictive power of the features using variable importance mea- sure based on random forests. The measure pertains to average decrease of node impu- rity (Gini impurity index for classification and residual sum of squares for regression). The average is taken over all splitting nodes and over all trees used to construct an en- semble classifier. The measure shows usefulness of a given feature for prediction when random forest is used as a prediction tool. Figure 3 shows top 20 features for predic- tion of selected traits for English. The plot clearly shows that features pertaining to words collected from World Well-Being Project (WWBP) are among the most useful for prediction. Moreover, it is interesting that simple style-based features like message length, numbers of exclamation marks or citations seem to be relevant in case of age identification. 5 Conclusions In this study we present a two-stage procedure for author profiling, which first exploits language similarities between users and then aims to discover more complex dependen- cies. The method is motivated by the fact that authors using very similar language tend to have identical traits. Interestingly, it turns out that combination of these two steps usually outperforms using each step separately. Our approach is based both on sets of word-based and text-based features. While we obtain successful results for gender and age prediction, the personality identification seems to be much more challenging – the error is slightly below the baseline. The assessment based on random forests shows high relevance of features using lexica from previous works. The results of experiments show many possibilities for future work. In our method, separate classification models Gender Age WWBPLexGender ● WWBPAllAge ● SSPositive ● SSPositive ● WWBPAllGender ● citations ● exclamations ● WWBPLexAge ● MRC_IMAG ● length ● MRC_MEANC ● errors ● MRC_MEAN ● exclamations ● NRCEmotion_joy ● MRC_FAM ● WWBPLexAge ● LexiconAFINN ● MRC_CONC ● wordLength ● NRCEmotion_positive ● MRC_MEANC ● WWBPAllPersE ● WWBPAllGender ● WWBPAllPersA ● NRCEmotion_fear ● NRCTwitterSentiment ● urls ● errors ● WWBPAllPersA ● hashtags ● WWBPAllPersO ● WWBPAllPersN ● WWBPAllPersE ● capitals ● MRC_AOA ● NRCEmotion_sadness ● SSNegative ● LexiconAFINN ● NRCEmotion_sadness ● 2 4 6 8 0.01 0.02 0.03 0.04 Mean decrease of Gini index Mean decrease of Gini index Stability Openness WWBPAllGender ● NRCTwitterSentiment ● WWBPAllPersC ● errors ● NRCHashtagSentiment ● repeatedLetters ● length ● SSPositive ● numbers ● WWBPAllPersA ● repeatedMarks ● questions ● WWBPAllPersN ● citations ● NRCEmotion_negative ● length ● WWBPLexGender ● MRC_IMAG ● NRCEmotion_trust ● NRCEmotion_positive ● SSPositive ● MRC_CONC ● NRCEmotion_positive ● WWBPAllPersC ● MRC_MEANC ● WWBPAllPersE ● MRC_FAM ● WWBPAllAge ● hashtags ● capitals ● errors ● WWBPAllGender ● urls ● exclamations ● NRCEmotion_sadness ● NRCEmotion_joy ● exclamations ● NRCEmotion_trust ● NRCEmotion_disgust ● yuleK ● 0.2 0.3 0.4 0.1 0.2 0.3 0.4 Mean decrease of residual error Mean decrease of residual error Figure 3. Feature importance measures based on random forest, for English. are build for each trait – it is worthwhile to explore dependencies between the traits to improve the prediction performance. Secondly, in order to significantly improve per- sonality identification, it seems necessary to look for new features. Finally, we believe that the advantages of using our two-stage procedure could be more clearly seen on larger corpus of tweets. Acknowledgements This study was supported by research fellowship within ”Information technologies: re- search and their interdisciplinary applications” agreement number POKL.04.01.01-00- 051/10-00. References 1. Beygelzimer, A., Kakadet, S., Langford, J., Arya, S., Mount, D., Li, S.: FNN: Fast Nearest Neighbor Search Algorithms and Applications (manual) (2013) 2. Kiritchenko, S., Zhu, X., Mohammad, S.M.: Sentiment Analysis of Short Informal Texts. Journal of Artificial Intelligence Research 50, 723–762 (2014) 3. Liaw, A., Wiener, M.: Classification and Regression by randomForest. R news 2, 18–22 (2002) 4. Mohammad, S.M., Turney, P.D.: Crowdsourcing a Word-Emotion Association Lexicon. Computational Intelligence 29(3), 436–465 (2013) 5. Moors, A., De Houwer, J., Hermans, D., Wanmaker, S., van Schie, K., Van Harmelen, A.L., De Schryver, M., De Winne, J., Brysbaert, M.: Norms of valence, arousal, dominance, and age of acquisition for 4,300 Dutch words. Behavior research methods 45(1), 169–77 (2013) 6. Nielsen, F.A.r.: A new ANEW: evaluation of a word list for sentiment analysis in microblogs. In: Proceedings of the ESWC2011 Workshop on ’Making Sense of Microposts’: Big things come in small packages. vol. 718, pp. 93–98. CEUR-WS.org (2011) 7. R Core Team: R: A Language and Environment for Statistical Computing. Tech. rep., R Foundation for Statistical Computing (2013) 8. Rangel, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd Author Profiling Task at PAN 2015. In: Cappellato, L., Ferro, N., Gareth, J., San Juan, E. (eds.) CLEF 2015 Labs and Workshops, Notebook Papers. CEUR-WS.org (2015) 9. Sap, M., Park, G., Eichstaedt, J.C., Kern, M.L., Stillwell, D.J., Kosinski, M., Ungar, L.H., Schwartz, H.A.: Developing Age and Gender Predictive Lexica over Social Media. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1146–1151. Association for Computational Linguistics (2014) 10. Schmid, H.: Improvements In Part-of-Speech Tagging With an Application To German. In: Proceedings of the ACL SIGDAT-Workshop. pp. 47—-50. Association for Computational Linguistics (1995) 11. Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. PLOS ONE 8(9) (2013) 12. Sidorov, G., Miranda-Jiménez, S., Viveros-Jiménez, F., Gelbukh, A., Castro-Sánchez, N., Velásquez, F., Dı́az-Rangel, I., Suárez-Guerra, S., Treviño, A., Gordon, J.: Empirical study of machine learning based approach for opinion mining in tweets. In: Proceedings of the 11th Mexican international conference on Advances in Artificial Intelligence (MICAI’12). Lecture Notes in Computer Science, Springer-Verlag (2013) 13. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D.: Sentiment Strength Detection in Short Informal Text. Journal of the American Society for Information Science 61(12), 2544–2558 (2010) 14. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S. Springer-Verlag (2002) 15. Wilson, M.: MRC psycholinguistic database: Machine-usable dictionary, version 2.00. Behavior Research Methods, Instruments, & Computers 20(1), 6–10 (1988) 16. Yule, G.U.: The Statistical Study of Literary Vocabulary. Cambridge University Press (1944)