Statistical Learning Methods for Profiling Analysis Notebook for PAN at CLEF 2015 Lesly Miculicich Werlen Computer Science Deparment - University of Neuchâtel Lesly.Miculicich@unine.ch Abstract Author profiling is the task to infer some information about an au- thor by analyzing her/his writing style. It’s application in forensics, business in- telligence and psychology makes this topic interesting for researching. In this notebook, we present our baseline approach using SVM and Linear Discriminant Analysis (LDA) classifiers. We analyze features obtained from LIWC dictionar- ies, these are frequencies of use words by categories, which gives a general view about how the author writes and what he/she is talking about. According the ex- perimental results, those are significant features to differentiate gender, age-group and personality. Although they are relatively few (not more than 100), they allow to discriminate with an acceptable accuracy. 1 Introduction Studies have demonstrated evidence of differences in the writing style according the gender and age of the authors. These differences are detected with the use of function- words and content-words. The function-words define how the person use the grammar and build sentences. On the other hand, content-words indicate what the person is talk- ing about. For example, Pennebaker [7] found that women tend to use more personal pronouns and words referring to emotions. By the contrary, men tend to use more nouns, prepositions and big words (defined as words with more than 6 characters). In the case of age, Pennebaker [7] found that younger writers use more personal pronouns in first person and past tense verbs; while older writers tend to use more articles, nouns prepositions and future tense verbs. Another example of this differences is in Schler et al. [9], where these authors found that men’s writing is more related to money, job and TV, while women’s writing is more related to family, sex and eating. In the case of age, younger people use to write more about sports, friends and emotions; while older people write more about money, job, and family. More recently studies expanded this analysis to determinate weather the personality influence the writing style too. For example Yarkoni [12] presented a detailed work were he found that extroverted people are more likely to speak about leisure activities, family and other persons than non extroverted. People open to new experiences talk more about friends, time and positive emotions than other people; and some other similar relations for all different personalities. Based on this evidence, we believe that the word categories are important features to determine the profile of an author of a text, and we will try to measure how much they can tell us about the authors of tweets. Therefore, in this notebook, we present a base- line author profile classifier based on statistical learning over word category features. First, we present the feature pre-processing, extraction and selection. Then, the classi- fication models to identify gender, age group, and personality. And finally, we present the description of the experiments, the results and the conclusions. 2 Features 2.1 LIWC The studies mentioned previously used the Linguistic Inquired and Word Count (LIWC) [6]. This tool propose a list of word categories, each category is formed by agreement between at least three judges. Then, given a text, it counts the number of words that the text have per each category. The idea is to know how frequent a person use each word category and with that data estimate some information about him or her. The LIWC categories are grouped in linguistic dimensions (e.g., part-of-speech); content dimensions(e.g., emotions and activities); or spoken dimensions (e.g. fillers and no fluencies markers). LIWC has dictionaries in a variety of languages, for this exper- iment English, Italian, Spanish and Dutch dictionaries are used. Each dictionary con- tains the list of predefined categories and the words associated with them, for example for the category positive emotions some associated words are “fun", “nice", “succes", etc. They were created over reiterative process of human judgment and were tested in several times in different studies to assure their validity. 2.2 Additional Categories Additionally, we include other two groups of categories: punctuation marks and tweet. We include seven categories in punctuation marks: question mark, exclamation, period, comma, colon, semi-colon and all punctuation. The last one groups any punctuation mark including the mentioned before. The tweet categories are added for the nature of the corpus data and because they are frequently employed by tweet users: emoticons, hyper-link, hashtag and references to other users. In Table 1 we can see a summary of the total categories analyzed in this experiment. Table 1. Number of categories used as features per language’s dictionary Dictionary Linguistic Content Spoken Punctuations Tweet Total English 24 40 3 7 4 78 Dutch 14 55 2 7 4 82 Italian 14 74 0 7 4 99 Spanish 14 55 2 7 4 82 Table 2 shows one example of our analysis with LIWC and the additional categories in the tweet: "ummm yesterday, I was with @Tim in a beautiful concert :) .". Table 2. Example of analysis with LIWC and additional categories for the tweet: "ummm yester- day, I was with @Tim in a beautiful concert :) ." Dimension Category Words Counting Word count all words 10 Function words “I",“ was", “with", “in", “a" 5 Linguistic Pronoun “I" 1 Article “a" 1 ... .... ... Affection “beautiful" 1 Positive Emotion “beautiful" 1 Content Time “yesterday" 1 ... .... ... Spoken No fluency “ummm" 1 All punctuation ,. 2 Punctuation ... ... ... Emoticon :) 1 Tweet Reference to other user “@Tim" 1 2.3 Feature Extraction The tweets were given in XML files. Each XML file was processed to extract only the text information given as scale. First, the text was tokenized using white space (includ- ing tab, change of line among others) and punctuation marks as separation characters. We consider the apostrophe as part of the word, for example “she’s" is consider a single word. This choose is related to the LIWC dictionaries that consider them in this way. Once the tokens are obtained, they were counted in the corresponding categories. As mentioned before, in the dictionary each category has a set of words, so, if the token is part of the set of word of a category then it sums one to that category. One token can appear in many categories (e.g. “I" as pronoun and function word). The granularity of the models is set by user and not by tweet, so, there is a vector of categories per each user. Once we complete the counting per user, we divide each count by the total number of words by user in other to obtain the relative frequencies. Finally, we keep the frequencies in a matrix format, where the columns are the word categories and the rows are the users. Thus, each row represent the distribution over LIWC categories for the given user. One additional step is performed before the feature selection, the relative word fre- quencies x are scaled by calculating the z − score respect to each category according to the following formula: x−µ z= . (1) σ where, µ and σ are the mean and standard deviation of the frequencies in each category. The frequency of use of words in not uniform in a language. Some of them are highly used (e.g., function words) and some others have low frequency of use (e.g., topic related words), the relative frequencies are scaled because we need to compare them obtaining their use related to each particular category and not to the general use of language. 2.4 Feature Selection Even we have a reduce set of features, we need to ignore noisy and irrelevant features before applying a classification scheme. Additionally, it derives in an easier linguistic explanation about the key features to discriminate among the different classes. Gender and Age Group Fourth feature selection methods were evaluated to deter- mine the more suitable for the data: Manual selection, Information Gain, Odd ratio and Support Vector Machine Recursive Feature Elimination (SVM RFE). The manual selection was based on [7] and [5], where it is explained which are the more general categories to differentiate an author according she/he’s age and gender. The Informa- tion Gain and Odd Ratio were based on the study of Sebastiani [11] where he compares different methods for feature reduction in text categorization. The SVM RFE proposed by Guyon [3] is a backward feature elimination using SVM, it eliminates one feature at time given a ranking criteria. The three last methods were implemented with Weka [4]. For their evaluation, three different classifiers were tested and the results were com- pared according the accuracy of classification. The best subset was obtained with SVM RFE. Personality In the case of personality, the number of classes to discriminate is larger than the previous models and it is more difficult to associate specific categories to each class. Consequently, the methods mentioned before did not have significant im- provement in comparison of using the full set of features. So for this case, we applied Forward-Backward Feature Selection, trying to improve the Root Mean Squared Error (RMSE). 3 Classification 3.1 Gender and Age Group These classes were defined as categorical. We have two classes for gender: “Male" and “Female", and fourth classes for age: “18-24", “25-34", “35-49", and “50-xx". The classification was made with ν-SVM [10], which is a variant of the original SVM but with an easier interpretation for the cost parameter called ν. In the experiments, ν was set to 0.01 and we used radial kernel. The implementation was made in R with the library “e1071". 3.2 Personality In the case of personality, we define one model per each personality. Our first approach was to define the classes as categorical without taking into account the order of the score of the personality, so we have 11 classes from “-0.5" to “0.5" with one decimal of difference. We choose two classifiers: ν-SVM with ν set to 0.01 and with radial kernel, and Linear Discriminant Analysis (LDA). The implementation was done in R with the libraries “e1071" and “MASS". 4 Experiments and Results 4.1 Training The corpus to develop the models is a training set of tweets in English, Spanish, Italian and Dutch given by PAN 2015 Author Profiling task [8]. The validation was made measuring the accuracy for gender and age-group, and RMSE for personality (according the specification of the task). We used the full training set with leave-one-out validation. The results are showed in Tables 3 and 4. Table 3. Gender and Age Group: Accuracy in %. Leave-one-out validation with training data set. Feature selection with SVM RFE and classification with SVM English Dutch Italian Spanish Gender 86 97 92 94 Age group 77 - - 69 Table 4. Personality: RMSE rounded to two decimals. Leave-one-out validation with training data set. Feature selection with Back-Forward Propagation and classification with LDA and SVM English Dutch Italian Spanish Personality LDA SVM LDA SVM LDA SVM LDA SVM Extroverted 0.16 0.16 0.14 0.13 0.21 0.18 0.16 0.19 Stable 0.22 0.21 0.12 0.22 0.18 0.18 0.18 0.20 Agreeable 0.16 0.15 0.23 0.15 0.21 0.16 0.15 0.16 Conscientious 0.15 0.15 0.18 0.12 0.18 0.12 0.14 0.19 Open 0.15 0.15 0.16 0.12 0.21 0.15 0.12 0.17 In the feature selection, the experiment shows that there are some categories to dis- criminate between gender which are independent of language while other are different for each language, and the same patter for the other models. Tables 5, 6, and 7 contain the common categories that were found in one or more languages. The training and testing are implemented separately. The outputs are the models and the vectors of means and standard deviation calculated with the training data. These vectors are used to calculate the z − score of the testing data. This step corresponds to Software 1 of TIRA [2]. Table 5. Selected Features for Gender: Common features among one or more languages Linguistic Prepositions, word count, you, pronouns Content Family, affect, space, swear, feel, emotions, body, home, work, TV, money, fu- ture, motion, school, inclusion (and, we, both), exclusion (or, either, but) Spoken None Punctuations Question mark, exclamation mark, colon Tweet Emoticon, reference to other users, hyper-links Table 6. Selected Features for Age-group: Common features among one or more languages Linguistic Prepositions Content Anger, body, optimist, insight, discrepancy, inhibition(block, constraint, deny) Spoken None Punctuations Comma Tweet Reference to other users, hyper-links Table 7. Selected Features for Personality: Common features among one or more languages Extroverted Word count, big words, pronouns, I, we, us, others, article, social, family, emoti- cons, reference to other users Stable Pronouns, oneself, we, us, others, article, affection, positive emotions, optimist, anxiety, sadness, anger, emoticons, reference to other users, hyper-links Agreeable Pronouns, I , others, prepositions, inhibition, sadness, certain, see, listen, dis- crepancy, causation, cognitive process, emoticons, reference to other users, hyper-links Conscientious Pronouns, I , us, others, time, present, past, work, motion, home, optimist, pos- itive emotions, number, reference to other users, hyper-links Open Pronouns, I , us, others, negation, preposition, number, affection, optimist, cer- tain, discrepancy, cause, tentative, see, insight, emoticons, reference to other users, hyper-links 4.2 Testing The corpus to test was given by PAN 2015 Author Profiling task [8]. The parameters are the input files and the models. This step corresponds to Software 2 on TIRA [2]. Table 8 shows the result for the testing. In almost all cases our solution performs better than the average with less runtime than the majority. The best global results were in Dutch and English, and the worse with Italian. In the case of gender and age-group, the results were good comparing with the state of the art using similar features, Arg- amon et al. [1] reported 72% accuracy to distinguish gender and 67% for age-group (having 3 groups). Specially in the case of Spanish, where we obtained 92% of accu- racy in gender. Nevertheless, the accuracy of classification of age-group was close to average. In the case of personality, the results are also good taking into account the difficulty of the data: bigger number of classes to discriminate many of them with very few or none samples to train, and the few quantity of features used (less than 100). The selected runs for testing personality where using SVM classifier. The global ranking of our solution for English was 7th over 22, Dutch 5th over 20, Italian 9th over 19, and Spanish 8th over 21. Table 8. Testing results: “GLOBAL" is the total performance of the solution, “Gender" and “Age" are measured by accuracy in %,“BOTH" is the accuracy when gender and age were both well clas- sified. The personality traits were measure with RMSE rounded to two decimal points, “RMSE" is the average of all personalities traits. English Performance GLOBAL BOTH Gender Age RMSE Extrovert. Stable Agreeable Conscient. Open Runtime Best 79 73 86 84 0.14 0.13 0.20 0.13 0.11 0.12 02:38:33 Our solution 71 57 79 69 0.15 0.13 0.22 0.13 0.13 0.12 00:00:12 Mean 67 51 71 69 0.18 0.16 0.24 0.16 0.16 0.16 03:48:25 Worse 52 22 50 41 0.24 0.23 0.32 0.22 0.22 0.26 05:23:51 Dutch Performance GLOBAL BOTH Gender Age RMSE Extrovert. Stable Agreeable Conscient. Open Runtime Best 94 - 97 - 0.06 0.08 0.06 0.00 0.10 0.04 00:00:01 Our solution 85 - 81 - 0.12 0.12 0.13 0.10 0.14 0.10 00:00:10 Mean 78 - 70 - 0.14 0.15 0.17 0.14 0.14 0.11 00:05:17 Worse 67 - 47 - 0.25 0.21 0.28 0.28 0.24 0.24 01:07:09 Italian Performance GLOBAL BOTH Gender Age RMSE Extrovert. Stable Agreeable Conscient. Open Runtime Best 87 - 86 - 0.10 0.07 0.16 0.05 0.11 0.10 00:00:01 Our solution 74 - 64 - 0.15 0.11 0.17 0.12 0.17 0.19 00:00:12 Mean 74 - 64 - 0.16 0.12 0.21 0.14 0.15 0.18 00:02:46 Worse 60 - 42 - 0.21 0.19 0.26 0.22 0.25 0.25 00:17:18 Spanish Performance GLOBAL BOTH Gender Age RMSE Extrovert. Stable Agreeable Conscient. Open Runtime Best 82 77 97 80 0.12 0.13 0.16 0.10 0.10 0.11 00:00:02 Our solution 73 63 92 68 0.16 0.19 0.20 0.13 0.14 0.17 00:00:13 Mean 67 52 79 62 0.18 0.18 0.22 0.16 0.17 0.16 00:10:36 Worse 50 22 56 36 0.27 0.30 0.29 0.26 0.27 0.27 01:00:24 5 Conclusions The present approach using LIWC Categories has demonstrated being a good solution regarding the limitation of having a few quantity of features compared with other solu- tions. According to the testing results, it had better performance than the average state of the art. Moreover, it is simple and efficient. But the most important point is that we can linguistically justify the classification decision because we can know which are the key features for the decision process. Deeper analysis is needed to extract the expla- nation of correct and incorrect assignment of classes; and to compare the differences in the results using SVM or LDA classifier. We think that it can be improved in future with a finer analysis of features and selection methods, and with a more appropriate definition of the classes and modeling for age-group and personality traits. References 1. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Communications of the ACM 52(2), 119–123 (2009) 2. Gollub, T., Stein, B., Burrows, S.: Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of the 35th international ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1125–1126. ACM (2012) 3. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine learning 46(1-3), 389–422 (2002) 4. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009) 5. Newman, M.L., Groom, C.J., Handelman, L.D., Pennebaker, J.W.: Gender differences in lan- guage use: An analysis of 14,000 text samples. Discourse Processes 45(3), 211–236 (2008) 6. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates 71 (2001) 7. Pennebaker, J.: The Secret Life of Pronouns: What Our Words Say About Us. Bloomsbury Publishing (2011) 8. Rangel, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at pan 2015. In: Cappellato L., Ferro N., Gareth J. and San Juan E. (Eds). (Eds.) CLEF 2015 Labs and Workshops, Notebook Papers. CEUR-WS.org (2015) 9. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blog- ging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. vol. 6, pp. 199–205 (2006) 10. Schölkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L.: New support vector algorithms. Neural Computation 12(5), 1207–1245 (2000) 11. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002) 12. Yarkoni, T.: Personality in 100,000 words: A large-scale analysis of personality and word use among bloggers. Journal of Research in Personality 44(3), 363–373 (2010)