Author’s Traits Prediction on Twitter Data using Content Based Approach Notebook for PAN at CLEF 2015 Fahad Najib, Waqas Arshad Cheema, Rao Muhammad Adeel Nawab Department of Computer Science, COMSATS Institute of Information Technology, Lahore, Pakistan. choudharyfahad@gmail.com, waqascheema06@gmail.com, adeelnawab@ciitlahore.edu.pk Abstract This paper describes the methods we have employed to solve the au- thor profiling task at PAN-2015. The proposed system is based on simple content based features to identify an author’s age, gender and other personality traits. The problem of author profiling was treated as a supervised machine learning task. First content based features were extracted from the text and then different ma- chine learning algorithms were applied to train the models. Results showed that content based features approach can be very useful in predicting the author’s traits from his/her text. 1 Introduction Authorship attribution concerns with the classification of documents into the classes to be predicted based on the writing style of their authors. In the case of author verification and author identification tasks, the style of individual authors is examined. Whereas au- thor profiling mean to distinguish between classes of authors studying their sociolect aspect, that is, how language is shared between people. This helps in predicting pro- filing aspects such as age, gender or personality type. Author profiling is a problem of increasing importance in several applications like forensics, security and marketing. E.g., from a forensic linguistics prospect, the linguistic profile of the sender of a harass- ing SMS message can be identified. Similarly, from a marketing perceptive, companies would like to know the demographics of the people that like or dislike their products on the basis of the text analysis of online product reviews and blogs. In recent years, automatic detection of an author’s profile from his/her text has be- come an emerging and popular research area (Rangel et al., 2013). Automatically pre- dicting the identity of authors from their texts has a lot of future applications. for e.g., forensics analysis (Corney et al., 2002; Abbasi and Chen, 2005), marketing intelligence (Glance et al., 2005) and classification and sentiment analysis (Oberlander and Nowson, 2006). 2 Related work A significant amount of research in automatic classification of texts into the classes to be predicted, has already been done by different researchers and linguists using several dif- ferent machine learning techniques (Sebastiani, 2002). Over the past few years, a large variety of techniques have been devised for predicting the text based on its author’s traits (Abbasi and Chen, 2005; Houvardas and Stamatatos, 2006; Schler et al., 2006; Argamon et al., 2009; Estival et al., 2008; Koppel et al., 2009). Previously different machine learning classifiers tried to include variety of techniques such as Lazy learners (IBk)(Estival et al., 2008, 2007), Support Vector Machine (SVM) (Koppel et al., 2009; Estival et al., 2007), LibSVM (Estival et al., 2008), RandForest (Estival et al., 2008), In- formation Gain (Houvardas and Stamatatos, 2006), Baysian Regression (Koppel et al., 2009), Exponential Gradient (Koppel et al., 2002) etc. Several approaches have been implemented and experiments conducted for select- ing the best possible features set for the most accurate classification. Houvardas and Sta- matatos (2006) showed the usefulness of n-gram, whereas Koppel et al. (2009) shown the effect of gender and age in blogging sites by considering different word classes and showing the relation of the word classes with the author’s age and gender. Koppel et al. (2009); Estival et al. (2007) have identified that the Part-of-speech is also an commend- able linguistic feature and (Calix et al., 2008) achieved the accuracy of 76.72% using 55 different features. 3 Experimental setup The data used in our experiments is the training dataset of PAN-2015 1 . The corpus consists tweets on different topics, grouped by author and labeled with his/her language, gender, age group and 5 personality traits ( extroverted (Ex), stable (St), agreeable (Ag), conscientious (Co) and open (Op)). The documents are categorized as in languages (English, Dutch, Italian and Spanish), two genders (male and female), and four groups ( 18-24, 25-34, 35-49 and 50-XX). With regard to personality traits, for each trait the scores lies between -0.5 and 0.5. Documents in the corpus consist of a collection of posts made by a single user. Language Gender Age-Group English Dutch Italian Spanish Male Female 18-24 25-34 35-49 50-XX 152 34 38 100 162 162 80 106 44 94 Table 1. Distribution of data in language, gender and age. The corpus was balanced gender wise within each age group but imbalanced in terms of age representation and the five personality traits scores distribution (-0.5 to 0.5). The proportion of languages, gender and each age group in the corpus within the training dataset is presented in Table 1, whereas the personality traits distribution in table 2. Prior to any model training or testing, we apply some pre processing steps to all documents. We eliminated all the data contents that were not determined to be the text written from the user like XML tags, as our primary source of features is the text written 1 http://pan.webis.de/ Class -5 -4 -3 -2 -1 0 1 2 3 4 5 Agreeable 0 0 5 7 30 33 81 105 34 13 16 Conscientious 0 0 0 3 6 58 78 59 55 41 24 Extroverted 0 0 4 4 15 33 87 89 36 31 25 Open 0 0 0 0 9 12 102 74 38 52 37 Stable 0 0 13 17 56 24 42 64 47 41 20 Table 2. Distribution of data in personality traits. by an author. Now as all the user posts lie within the unparsed data tags of the source’s xml file, we disregard any text not within these tags and HTML tags also. 4 Feature selection As male and females like to write about different topics, they use different words ac- cordingly. This leads to the fact that content based features can be an important tool to distinguish between texts of males and females (Schler et al., 2006). For example, a tweet concerned to sports will be more likely to be written by a male author rather by a female. That tweet may contain words like goal, score, world cup etc. So the occur- rence of words like these will increase the chances of it being written by a male author. Similarly occurrence of words or phrases like my husband, shopping, nailpolish etc will increase the chances of it being written by a female author. In a similar fashion, people in their teen age like to write more about their school life, and friends. Whereas people in their 20’s like to write more about their college life and people of 30’s write more about jobs, marriage and politics. So the content based features can be an important tool to distinguish between texts written by people belonging to different profiles. Word Gender Age-Group Male Female 18-24 25-34 35-49 50-XX fuck .22 .14 .29 .07 0 0 love .16 .37 .28 .17 .05 .03 peopl .19 .14 .21 .08 .04 0 feel .13 .07 .14 .06 0 0 data .13 .01 0 .02 .01 .02 life .13 .13 .16 .04 .06 0 time .12 .19 .02 .01 0 1 job .09 .01 .01 .09 0 0 girl .09 .02 .1 .01 0 0 day .21 .14 .2 .09 .03 .03 Table 3. Frequencies comparison of unigrams of English language for age groups and gender. We calculated the frequencies of different unigrams in the texts written by a par- ticular profile. Then, for every unigram, we calculated the ratio of its frequencies in the tweets of different classes like male and female, different age groups and different personality traits. Finally we selected the features on the basis of two combined factors. First the unigrams with the highest frequencies in the corpus, and second the differ- ence of frequencies in different classes which are to be classify from one another. The frequencies of some of the most frequently used unigrams for English language in the corpus and their frequency comparison along gender and age groups is given in table 3. Similarly this routine has carried out for all the four languages individually and four sets of content based features selected for each language each. Then for each language, two different sets of features has been used, one for gender and age group prediction and one for the personality traits classification. 5 Models and Evaluation For each of the four languages, we trained different models. Then for each language, we trained two models, one for age and gender and one for the personality traits. So there were total training models build all on the content based features. We ran the experiments on four machine learning classifiers: J48, Random Forest, Support Vector Machines (SMO), and Naive Bayes. The evaluation measures used as instructed by PAN 2015 2 , accuracy for age and gender and root mean squared error for the five personality traits. 6 Results and Analysis Language Gender Age Both Ex St Ag Co Op RMSE Global English 0.914 0.967 0.894 0.076 0.093 0.088 0.087 0.077 0.084 0.905 Italian 1.000 NA NA 0.028 0.051 0.048 0.032 0.038 0.039 0.980 Spanish 0.940 0.990 0.930 0.085 0.110 0.084 0.102 0.083 0.093 0.918 Dutch 1 NA NA 0 0 0 0 0 0 1 Table 4. Results on training data. Based on the performance of the four classifiers (see section 5) on the training data, we choose only one single classifier (SVM) for all the classes of age, gender and personality in our final software that we have submitted in the competition. We are repoting the results that we have achieved on the training data in table 4 and finally on the testing data in table 5. Here the combined accuracy for age and gender (open), the combined root mean squared error for all five personality traits (RMSE) have been reported as well. Again here the age group results for Italian and Dutch are missing in both training and testing data suggesting there is only 1 class. In comparison of performance in different languages, our system performed better for English and Dutch as compared to Spanish and Italian. Traits wise the performance 2 http://pan.webis.de/ Language Gender Age Both Ex St Ag Co Op RMSE Global English 0.591 0.669 0.422 0.187 0.261 0.176 0.161 0.195 0.196 0.613 Italian 0.527 NA NA 0.160 0.220 0.157 0.136 0.190 0.173 0.667 Spanish 0.840 0.568 0.454 0.159 0.247 0.188 0.152 0.171 0.183 0.635 Dutch 0.468 NA NA 0.136 0.176 0.091 0.123 0.091 0.124 0.672 Table 5. Results on testing data. is reasonably good for Spanish gender, English age and Dutch personality. Overall, our system couldn’t perform as well on the testing data as it performed on training data. The possible reasons for this are the variation of language in training and testing data and the nature of the content based approach. 7 Conclusion In this paper, we have presented a content based technique for the automatic classifi- cation of the author’s gender, age and personality from their writing. This work has a number of potential applications like marketing, forensics, and security. We have per- formed our experiments on the training data provided by the PAN-2015 organizers. We applied some simple content based techniques and the results we achieved are highly motivational showing the usefulness of content based features in predicting the author’s profile from text. In future work, the results can be further improved by incorporating and finding more suitable features. Bibliography 1. Ahmed Abbasi and Hsinchun Chen. Applying authorship analysis to arabic web content. In Intelligence and Security Informatics, pages 183–197. Springer, 2005. 2. Shlomo Argamon, Moshe Koppel, James W Pennebaker, and Jonathan Schler. Automatically profiling the author of an anonymous text. Communications of the ACM, 52(2):119–123, 2009. 3. K Calix, M Connors, D Levy, H Manzar, G MCabe, and S Westcott. Stylometry for e-mail author identification and a uthentication. Proceedings of CSIS Research Day, Pace University, 2008. 4. Malcolm Corney, Olivier de Vel, Alison Anderson, and George Mohay. Gender- preferential text mining of e-mail discourse. In Computer Security Applications Con- ference, 2002. Proceedings. 18th Annual, pages 282–289. IEEE, 2002. 5. Dominique Estival, Tanja Gaustad, Son Bao Pham, Will Radford, and Ben Hutchinson. Author profiling for english emails. In Proceedings of the 10th Conference of the Pa- cific Association for Computational Linguistics (PACLINGâĂŹ07), pages 263–272. PACLING, 2007. 6. Dominique Estival, Tanja Gaustad, Ben Hutchinson, Son Bao Pham, and Will Radford.Author profiling for english and arabic emails. 2008. 7. Natalie Glance, Matthew Hurst, Kamal Nigam, Matthew Siegler, Robert Stockton, and Takashi Tomokiyo. Deriving marketing intelligence from online discussion. In Pro- ceedings of the eleventh ACM SIGKDD international conference on Knowledge dis- covery in data mining, pages 419–428. Association for Computing Machinery, 2005. 8. John Houvardas and Efstathios Stamatatos. N-gram feature selection for authorship identification. In Artificial Intelligence: Methodology, Systems, and Applications, pages 77–86. Springer, 2006. 9. Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. Automatically catego- rizing written texts by author gender. Literary and Linguistic Computing, 17(4): 401–412, 2002. 10. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational methods in authorship attribution. Journal of the American Society for information Science and Technology, 60(1):9–26, 2009. 11. Jon Oberlander and Scott Nowson. Whose thumb is it anyway?: classifying author per- sonality from weblog text. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 627–634. Association for Computational Linguistics, 2006. 12. Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Giacomo Inches. Overview of the author profiling task at pan 2013. Notebook Papers of CLEF, pages 23–26, 2013. 13. Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W Pennebaker. Effects of age and gender on blogging. In AAAI Spring Symposium: Computational Ap- proaches to Analyzing Weblogs, volume 6, pages 199–205, 2006. 14. Fabrizio Sebastiani. Machine learning in automated text categorization. ACM comput- ing surveys (CSUR), 34(1):1–47, 2002.