Author profiling using LDA and Maximum Entropy Notebook for PAN at CLEF 2013 Aditya Pavan, Aditya Mogadala, Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology , Hyderabad aditya.pavanm@students.iiit.ac.in , aditya.m@research.iiit.ac.in , vv@iiit.ac.in Abstract. This paper describes the traditional authorship attribution subtask of the PAN/CLEF 2013 workshop. In our attempt to classify the documents based on gender and age of an author, we have applied a traditional approach of topic modeling using Latent Dirichlet Allocation[LDA]. We used the content based features like topics and style based features like preposition-frequencies, which act as the efficient markers to demarcate the authorship attributes based on age and gender. We demonstrated tenfold cross validation and observed that our classification approach using Maxent and LDA gave an accuracy of 53.3% for English language and 52% for Spanish Language. 1 Introduction Authorship Attribution or author profiling has been a standard problem addressed in the areas of Information Retrieval, Statistical Natural Language Processing and Machine Learning. With increase in the number of user blog-posts and micro-blogs in the massive internet domain, author profiling task serves as a pre-processing step to help augment the prospects in several areas of text processing like Opinion Mining, mood mining and Polarity extraction. Every user comment or blog post is directly or indirectly associated with several attributes of author like age, gender and other demographic features. Extracting these features on a given document is of paramount priority. As a part of PAN competition, we have applied a traditional approach for extracting features of a document and predict the gender and age of an author. We have considered the topics used by the authors in the article as standard features and built a topic model from the corpus using unsupervised learning techniques like [LDA] Latent Dirichlet Allocation [4]. From the generated topic model, we trained a discriminative model using Maxent classification to profile the documents based on gender and age of the author. The same discriminative model was used for inferring tenfold validation data set. The paper is organized as follows. Section 2, provides a brief explanation on various features we have adopted to derive authorship attributes like age and gender. Section 3, explains our approach. Section 4 concludes our work. 2 Features 2.1 Explaining the features Based on the variations in the expressions of authors, features used for author profiling can be categorized into two types: Content-based features and Style-based features [1]. In the earlier work, several markers like textual style, Vocabulary complexity, Orthographic errors and morphological mapping were used for capturing the authorship attributes. But, preponderance of evidence suggests that wide variety of features were captured by simple markers like function-words [2] and individual parts-of-speech. However, in this paper we focus on extracting age and gender of an author based on the topics used in the document and the distribution of the corresponding topics with in the corpus. Since characteristics of an author are directly dependent on the age [2] and gender [2,3] of the author, which in turn are contingent on the usage of the topics in the article, our work primarily is focused on building essential topic model that naturally subsumes simple markers like Noun-phrases in parts-of-speech and other complex markers. In addition to content-based features like topics, we also considered style-based features like frequency of prepositions used by the author and the number of superlative adjectives used within a document. 2.2 Features for Age and gender As mentioned earlier, topics play a significant role in predicting the age of an author. In our present work, we have observed that usage of the topics vary from one age group to the other. The corpus of author documents used in this task provide a substantial evidence that the articles of users ranking within the age groups of 10s (13-17) comprise of topics related to adolescence, school activities and immature crush. While users in an age group of 20s (23-27) write about their college life, favorite heroines/ heroes, Pre-marital affairs, etc. Whereas, users belonging to age group of 30s (33-47) post more about Corporate / Social activities, Post-marriage life, etc [2]. Similarly, male authors stress on topics related to sports, politics and technology whereas the female authors post on topics like beauty, shopping, kitty parties, etc. [3] But we have observed from the data that although the topic-set used by an author abets in demarcating the age groups, there are considerable overlaps in the topics among the age groups and genders. In order to resolve these overlaps, we considered a topic distribution model rather than just a set of topics. We have used a generative model called Latent Dirichlet Allocation (LDA) [4] to get a probabilistic distribution of the topics in the document. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. Thus generating models using LDA has been an essential step in extraction of features in our experiment. 3 Approach 3.1 Processing of Corpus We have used the corpus available in PAN website. Since the data was in the form of mark up, we generated a clean data by parsing the tags and eliminating the unnecessary duplications. In order to discriminate train and test data, we created ten- fold cross validation sets and within training sets we generated datasets for individual age groups and individual genders. Our working model is independent of the language. So for both the Spanish and English data sets, we have employed similar approach. 3.2 Calculating frequencies Prior Works [2, 3] imply that male authors tend to use more prepositions in the articles or blog posts than the female authors. As a part of our style-based features we have generated the frequencies of prepositions of authors in each document and generated the tf-score. We have not considered the anomalies and other dialectic exceptions as it can lead to over fitting of the model. So we have used this generalized observation to demarcate the gender based authorship attributes. 3.3 Generating topic models In order to implement the concept of topic modeling, we used a java-based package named Mallet [5]. Since the topic distribution disregards the usage of function words and stop words, we eliminate them from our individual data sets. We have also precluded the preprocessing steps like stemming and lemmatization on the datasets in order to retain the style based features of the authors. For example, an author posting an article on cricket would allude the term ‘bowling’ in the context of the game. If we run our preprocessing steps like lemmatization of stemming on this word, the result would be ‘bowl’, which can have multiple contexts to kitchenware or cricket. Though LDA takes care of these differences, in order to retain the author style and subsume the noise in the corpus, we precluded these steps. The gender specific data sets and age specific data sets were subjected to topic modeling and we have generated five corresponding topic models. Each topic model was built with a distribution on 250 topics and 1000 iterations. 3.4 Classification using Maxent Earlier, linear classifier like Winnow, which overcomes differences between the genres and dependencies between features or the generative model like Naïve Bayes, which considers bag of words were used by several teams for author profiling. But we chose to use a discriminative model like Maxent as it would suffice our goal of classifying the document based on gender as well as age groups. Since the input for the classification task is the distribution of topics, in order to improve the maximum likelihood during estimation, the maximum entropy was used. The model essentially eliminates the over fitting aspects as it can normalize the duplication and co- occurrences of same features. During classification, we merged the features like preposition frequencies with the topic vector and trained our Maxent Classifier. We imported the Maxent classifier provided by mallet and ran our experiments with default hyper parameters and nine-tenth of training portion. 4 Conclusion and Future work In this task of author profiling, we have applied an unsupervised learning method to extract the distribution of topics. We used a topic size of 250 for 1000 iterations on the dataset. We used a Maxent classifier to classify the documents based on gender and age groups and observed that performance of these models are independent of the language. In order to improve the performance of the system, one can use better stylometric features concomitant to the content-based features. Better markers like POS tagging, superlative adjective occurrence can be used to improve the performance of the gender specific profiling task. 5 References 1. S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically profiling the author of an anonymous text, Communications of the ACM 52 (2): 119– 123. 2. J. Schler, Moshe Koppel, S. Argamon and J. Pennebaker (2006), Effects of Age and Gender on Blogging, in Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006. 3. M.Koppel, S. Argamon and A. Shimoni (2003), Automatically categorizing written texts by author gender, Literary and Linguistic Computing 17(4), November 2002, pp. 401-412. 4. Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). "Latent Dirichlet allocation". In Lafferty, John. Journal of Machine Learning Research 3 (4– 5): pp. 993–1022. 5. McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu. 2002.