Predicting an author’s demographics from text using Topic Modeling approach Notebook for PAN at CLEF 2015 Hafiz Rizwan Iqbal, Muhammad Adnan Ashraf, Rao Muhammad Adeel Nawab COMSATS Institute of Information Technology, Lahore rizwan.iqbal@ciitlahore.edu.pk,adnan.ashraf@ciitlahore.edu.pk, adeelnawab@ciitlahore.edu.pk Abstract. The paper presents an approach to predict personality traits of a writ- er for the author profiling task of the PAN CLEF 2015. The task aimed at pre- dicting authors’ demographics based on the written tweets of an author. These demographics included traditional authorship attributes of age, gender and vari- ous personality traits of an author. We applied topic modeling using LDA as baseline approach and used the generated topic to get hierarchical probabilities of the topics. J48 decision tree was used for training classification model. The trained models were then used to successfully predict the demographics of training and test datasets 1 Introduction Identifying various demographic traits such as, age, gender, native language and other personality aspects, from the authors writing style is termed as Author profiling [2]. Due to its high implication in the computer forensics, marketing and content recom- mendations over the internet, it has become a hot research area in Natural Language Processing. Twitter has been the field of quantitative study on a number of aspects and charac- teristics recently. The primary interest of researchers has been to process the user tweets to interpret users’ interests and to correlate social and global happenings [1] whereas this research focus on predicting the author profiling attributes. Twitter da- taset has been used in this research for author profiling. PAN 15 is the competition held as a part of CLEF Conference. The PAN 15’ com- petition is designed for three different tasks namely, Plagiarism Detection, Author Verification and Author Profiling. Each task required to develop a composite software and submission on the TIRA, an evaluation engine. The PAN 15’ Author profiling task was designed to evaluate seven demographic constraints of the author from his/her tweets. These demographics include identifying authors’ age, gender and five personality traits which include extroverted, stable, agreeable, conscientious and open. The training corpus was provided by PAN in four different languages, English, Spanish, Italian and Dutch. The target was to achieve the highest ranking rating, which included ratio for accurately identifying the authors age and gender and the average Root Mean Squared Error for the personality constraints. To predict a given author’s attributes, we generated LDA based topic models using mallet and used J48 decision tree in Weka for training and evaluation of our model. LDA identifies latent topic associations in multi-document collection where each topic is assigned a probability with respect to all other topics in a document and also each topic is assigned a probability with respect to number of words [1]. Topic model- ing using standard LDA has gained attention recently and work has been conducted in community detection using LDA [11] and author profiling. Topic modeling using LDA has also provided encouraging results in microblogging and its application [12].MALLET [9], a famous topic modeling and inferring toolkit, uses LDA to build the topic models for given text. This paper focuses on the English tweets of the PAN 15’ provided dataset for both training and testing phases [7]. The detail of the methodology is explained in the Sec- tion 2 while results of training phase and testing phase are discussed in the Section 3 and Section 4, respectively. Section 5 provides conclusion and future work. 2 Proposed Approach We used topic modeling [3] as the baseline approach to predict an author’s profile on the basis of his/her tweets. Why topic modeling as baseline approach? It has been analyzed that different categories of people have different topics of interests [6] e.g. women mostly talk about fashion, dresses and cooking etc. whereas men like to dis- cuss politics, cricket and technology etc. This natural phenomenon leads us to predict a person’s age, gender and other personality traits on the basis of his/her written text. There are the three stages in our proposed approach (1) Dataset Pre-processing, (2) Fabrication of Topic and Classification Models (3) Prediction of author traits. 2.1 Pre-processing: The English Language training dataset provided by PAN 15’ was selected for the author profiling task. The training dataset consisted of 152 users’ tweets. Each user’s data was placed separately in an xml file. The classifications of all xml files were placed in a single text file. During pre-processing phase only tweets were extracted from each xml file and were stored in a separate text file for each user. There was no further pre-processing performed on the dataset, such as stop word removal, stemming, removal of punctua- tion marks, lemmatization, as the topic model disregards it and to retain the author’s original style based features [4]. 2.2 Fabrication: Topic and Classification Models The provided dataset consisted of three main demographic traits of users, i.e. gender, age and personality constraints. Age and gender had accuracy values in classification whereas the five personality constraints had root mean error as the classification val- ues. A directory structure was created with subdirectories for two demographics (age and gender) and five personality traits (extroverted, stable, agreeable, conscientious and open). Table 1 enlists the classification details of the dataset provided in PAN 15’. The text files extracted in pre-processing stage were placed in their classification based subdirectory structure. The dataset contained equally distributed profiles for the male and female authors. By analyzing the dataset, it was found that the majority of the profiles’ authors were from the first two age groups (i.e. 18-24 and 25-30) where- as the profiles from age group 34-50 and 50+ were relatively lower. Each personality identifier was further classified based on provided root mean square error value rang- ing between -0.5 and 0.5 [7]. Each subdirectory was imported into MALLET, ran the topic modeling routine with setting of 20 topics for each subdirectory and inference file. As an output of this routine, list of extracted topics, topic composition file (file which contains the proba- bility of participation of other topics into a single topic), trained topic model and topic inference file [9] was generated in a sequential order with respect to each trait directo- ry. ARFF (Attribute Relation File Format) [10] file was created from the topic compo- sition file. Each topic was considered as one attribute and its probability taken as val- ue of that attribute. Classification attribute was created for each arff file with respect to each personality trait. Each author arff file was sourced to WEKA and J48 tree classifier algorithm [10] was applied for construction of classification model for the respective personality attribute. Table 1. -Classification of English dataset Gender Male Female 76 76 Age 18-24 25-34 34-49 50+ 58 60 25 12 Extroverted -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 1 4 10 17 41 37 20 13 9 Stable -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 11 5 22 9 19 37 19 18 12 Agreeable -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 5 2 12 19 44 46 13 7 4 Conscientious -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0 1 4 30 38 27 33 12 7 Open -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0 0 2 1 47 39 12 19 21 2.3 Prediction of Author Traits To predict files in test data set, first two steps of the proposed approach with little variation in step 2, were applied on each test file to get the topics list, topic composi- tion file and finally arff file. The test file was then compared with trained classifica- tion model to predict each personality trait value. The predicated results were then output in an xml file as per the task requirement. 3 Results for Training Phase The final submission consisted of java based composite software which required an input directory consisting of xml files and an output directory to place the resultant xml files. The submitted software was first run on training dataset. Table 2 shows the results obtained on the PAN 15’ training dataset with accuracy as evaluation measure for age and gender attributes whereas the personality traits’ results based on Root Mean Square Error [RMSE] are presented in Table 3. The results show that our soft- ware was able to predict 54% correct classification for the age and 81.5% for the Gender whereas 44.7% correct predictions were made for both correct age and gender for the users. Similarly the results on personality traits are also encouraging. Table 2. - Results on Age and Gender Age Gender Both 0.540 0.815 0.447 Table 3. - Results on Personality Traits Extroverted Stable Agreeable conscientious Open RMSE Global 0.150 0.200 0.154 0.149 0.100 0.151 0.648 4 Results for Testing Phase The trained models were then run on the English test dataset 2 provided by PAN 15’. The evaluated test results are manipulated in the Table 4 and Table 5. The Test results on age and gender were different from the training dataset results. We were able to predict the age more accurately (69.7%) than the age on training dataset (54%) but gender prediction was poor (55.6 %) with respect to the gender on the training dataset (81.5%). Similarly the results of the personality traits on the test dataset were also encouraging with respect to the training dataset. Table 4. - Test Results on Age and Gender Age Gender Both 0.697 0.556 0.394 Table 5. - Test Results on Personality Traits Extroverted Stable Agreeable conscientious open RMSE Global 0.208 0.315 0.191 0.190 0.214 0.224 0.585 5 Conclusion and Future Work Author profiling requires an efficient and effective system for analyzing data for secu- rity and commercial purposes. In our approach, we developed a java based software that implied LDA for topic model and J48 classification algorithm to predict writers’ demographics from the twitter dataset provided by PAN 15’. The results obtained are very encouraging especially the accuracy measures. Future efforts can be focused on applying the different variations of topic modeling algorithm such as hierarchical LDA and implying supervised classification models to predict the demographic traits more accurately and precisely. The code will be opti- mized and effort can be put to minimize the total runtime of the software. 6 References 1. Liangjie Hong and Brian D. Davison.: Empirical Study of Topic Modeling in Twitter, 1st Workshop on Social Media Analytics (SOMA ’10), Washington, DC, USA (2010) 2. M. Suraj, S. Prasha and S. Thamar.: A Simple Approach to Author Profiling in MapRe- duce, Notebook for PAN, CLEF (2014) 3. Blei, David M.; Ng, Andrew Y.: Jordan, Michael I.: Latent Dirichlet allocation. In Laffer- ty, John. Journal of Machine Learning Research 3 (4–5): pp. 993–1022. (2003) 4. Pavan A., Mogadala A., Varma V.: Author profiling using LDA and Maximum Entropy, Notebook for PAN at CLEF (2013) 5. Caruana, R. and Niculescu-Mizil, A.: An Empirical Comparison of Supervised Learning Algorithms”. In Proceedings of the International Conference on Machine Learning. Pitts- burgh, Pennsylvania, pp. 161–168 (2006) 6. Rangel, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at pan 2015. In: Cappellato L., Ferro N., Gareth J. and San Juan E. (Eds). (Eds.) CLEF 2015 Labs and Workshops, Notebook Papers. CEUR-WS.org, (2015) 7. K Santosh, Romil Bansal, Mihir Shekhar, and Vasudeva Varma.: Author Profiling: Pre- dicting Age and Gender from Blogs, Notebook for PAN at CLEF (2013) 8. D. Ramage, S. Dumais, and D. Liebling.: Characterizing microblogs with topic models. In International AAAI Conference on Weblogs and Social Media, (2010) 9. McCallum, Andrew Kachites.: MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu (2002) 10. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten.: The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1. (2009) 11. H. Zhang, C. L. Giles, H. C. Foley, and J. Yen. Probabilistic community discovery using hierarchical latent Gaussian mixture model. In AAAI’07: Proceedings of the 22nd Nation- al Conference on Artificial Intelligence, pages 663–668, (2007).