An Author Profiling Approach Based on Language-dependent Content and Stylometric Features Notebook for PAN at CLEF 2015 Alberto Bartoli, Andrea De Lorenzo, Alessandra Laderchi, Eric Medvet, and Fabiano Tarlao DIA - University of Trieste, Italy bartoli.alberto@univ.trieste.it, andrea.delorenzo@units.it, alessandra.laderchi@gmail.com, emedvet@units.it, fabiano.tarlao@phd.units.it Abstract We describe the approach that we submitted to the 2015 PAN com- petition [5] for the author profiling task1 . The task consists in predicting some attributes of an author analyzing a set of his/her Twitter tweets. We consider several sets of stylometric and content features, and different deci- sion algorithms: we use a different combination of features and decision algo- rithm for each language-attribute pair, hence treating it as an individual problem. 1 Problem statement A problem instance consists of a tuple xD, Ly, where D is a set of tweets written by the same author and L is a value of enumerated type that describes the language of the tweets—English, Spanish, Italian, or Dutch. The author profiling consists in generating, given a problem instance, the value for several attributes with respect to the author of the tweets: gender, age group (only for English and Spanish), and 5 personality traits. Age group is an enumerated value among the following: 18–24, 25–34, 35–49 or ě 50. The 5 personality traits are widely ac- cepted characteristics used to describe human personality (also known as Big Five [7]): extroversion, neuroticism, agreeableness, conscientiousness, and openness to experi- ence. For each trait, the attribute value consists of a score in r´0.5, `0.5s. A set of solved problem instances (the training set) is available in which, for each problem instance xD, Ly, the tuple of the attributes values is provided. The effectiveness of a method for author profiling is assessed using a testing set of solved problem instances. In particular, the effectiveness is assessed separately for each attribute as follows: the attribute values generated by the method for the problem 1 During the competition we discovered several opportunities for fraudulently boosting the ac- curacy of our method during the evaluation phase. We will describe these opportunities in a future report. We notified the organizers which promptly acknowledged the high relevance of our concerns and took measures to mitigate the corresponding vulnerabilities. The organizers acknowledged our contribution publicly. We submitted for evaluation an honestly developed method—the one described in this document—that did not exploit such unethical procedures in any way. instances in the testing set are compared against the actual values and the comparison outcome is expressed in terms of accuracy for gender and age, and in terms of Root- mean-square error (RMSE) for the personality traits. 2 Our approach We chose to handle the prediction of each attribute for each language as an individual problem: in particular, we consider gender and age group prediction as 2 classification tasks and personal traits prediction as 5 regression tasks. Since we had tweets written in four languages and we had to predict age groups for those written in English and Spanish only, we hence considered 26 different problems. We propose a machine learning approach based on a number of different stylomet- ric and content features which are processed by one among three different decision algorithms—we used SVM and random forests as classifiers and regressors. We carried out an extensive experimental campaign for systematically assessing a large number of the possible combinations, through leave-one-out cross validation on the available training data. 2.1 Training set analysis and repetitions During preliminary analysis, we noticed that the training set included some subsets of problem instances for which L and the solution were the same, i.e., the attributes values for all the problem instances in a subset were the very same, despite being D different. We call repetitions those problem instances. We argued that the tweets of the problem instances in each of those subsets were authored by the same person. For this reason, we decided to build a new training set by replacing each of those subsets with a single prob- lem instance in which D is the union of all the tweet sets of the subset—i.e., we merged the repetitions. Table 1 shows the sizes of the training set portions corresponding to each language before and after merging repetitions. We later experimentally verified that this transformation did affect the learned classifiers and regressors. Language Original Merged English 152 83 Spanish 100 50 Italian 38 19 Dutch 34 18 Table 1. Number of problem instances in the original training set and in the new training built by merging repetitions. 2.2 Features The feature extraction procedure requires a language-dependent dictionary in which words are grouped according to their prevalent topic (e.g., “money”, “sports”, or “re- ligion”) or their function (e.g., “prepositions”, “articles”, or “negations”). To this end, we used an English dictionary similar to the one used by LIWC [4]. For the other 3 languages, we proceeded as follows. For Spanish and Dutch, we built the dictionary by automatically translating the English dictionary with Google Translate. For Italian, we manually built the dictionary, by using the English one as guideline. Moreover, for each language, we augmented the dictionary with a new category of words (“chat acronyms”) containing the top fifty most popular chat acronyms exposed on NetLingo2 . The feature extraction procedure is also based on the notion of automatic tweet, that we define as follows. We determined a set of ordered sequences of n “ 1, . . . , 4 words, that we call templates, based on an analysis of the full training set: 1. we automatically extracted from the full training set all tweets starting with the same ordered sequence of n words; 2. we automatically constructed a set including all word sequences that were the start- ing sequence of at least 3 different tweets; 3. we manually analyzed each sequence and retained only those which appeared to be the beginning of an automatically-generated tweet. We say that a tweet is an automatic tweet if its first words correspond to a template. Table 2 provides some examples of templates, along with the presence or absence of corresponding automatic tweets of different languages in the training sets. Template EN ES IT NL # Move más reciente X Photo: X X X I’m at X X X I liked a X X I favorited a X X X Ik vind een X X #in X X Total number of templates 29 8 12 1 Table 2. Some examples of templates and the languages for which at least one automatic tweet with that template were found. The first row corresponds to a template found only in Spanish problem instances, while the other rows are templates found in problem instances of multiple languages. The last row contains, for each language, the count of templates for which at least one automatic tweet with that template was found. The feature extraction procedure is as follows. Given a problem instance xD, Ly, we denote by DM the set of tweets obtained by D by removing all the automatic tweets. We extract several numerical features from each problem instance: the value of all (except of 3) features is obtained by averaging the corresponding computation outcomes on the tweets in D or DM —the remaining three feature values are computed on the whole D and/or DM . For ease of presentation, we group conceptually similar features together; the full list is given in Table 3. 2 http://www.netlingo.com/top50/popular-text-terms.php Stylometric These features tend to capture the structural properties of a tweet in a way largely independent of both the language and the specific semantic content; therefore, they are not based on the dictionaries. Stylometric features are computed on tweets in DM : the reason is because we assume that automatic tweets are not really representative of the tweet writing style of the author. Content These features are based on the dictionaries categories related to word topic and are computed on tweets in D: the reason is because we assume that the content of automatic tweets is indeed informative of the author profile. Hybrid These features are based on the dictionaries categories related to word function and are computed on tweets in DM . 2.3 Feature selection Past studies on author profiling report several correlations between gender, age, per- sonality traits and writing style. In particular, [6] showed that stylometric features are more predicitve than content features for determining the gender, and viceversa for the age group, but the combination of both stylometric and content features can offer bet- ter results. In [3], the authors provided a list of correlations between some LIWC and non-LIWC features and the five personality traits. We constructed 40 different feature groups based on this knowledge and we assessed each of the resulting feature groups as described in the next section. 2.4 Classifier and regressor We decided to build a different model for each language-problem pair, for a total of 26, as described in Section 1. We explored the usage of SVM [2] and Random Forest [1] with different configurations, as these methods can be used both as classifiers and as regressors. In particular, we considered: – svm: SVM with default gaussian kernel and C “ 1; – rf500: Random Forest with 500 trees; – rf2000: Random Forest with 2000 trees. 3 Analysis As described in the previous sections, we considered 40 sets of features and 3 classi- fiers/regressors. We systematically assessed the effectiveness of all the 120 resulting combinations by means of a leave-one-out procedure applied on the training set, sep- arately for each language-attribute pair. That is, for each language-attribute pair, set of features, and classifier/regressor, (i) we built the subset T of the problem instances of the training set with that language, (ii) we removed one element t0 from T , (iii) we computed the values for the features set on the problem instances in T and trained the classifier/regressor, (iv) we applied the trained classifier/regressor to the problem in- stance t0 and compared the generated answer against the known one. We repeated all but first steps |T | times, i.e., by removing each time a different element, and computed Feature name Description allpunc Number of .,:; commas Number of , exclmar Number of ! questma Number of ? parenth Number of parenthesis numbers Number of numbers wocount Number of words stylometric longwor Number of words longer than 6 letters upcawor Number of uppercase words carrret Number of carriage returns (\n, \r, \r\n) atmenti Number of @ mentions extlink Number of links hashtag Number of # posemot Number of positive emoticons negemot Number of negative emoticons emotico Number of emoticons emotiyn Presence of emoticons in D (binary feature) moneywo Number of words in the “money” category jobword Number of words in the “job or work” category sportwo Number of words in the “sports” category televwo Number of words in the “tv or movie” category sleepwo Number of words in the “sleeping” category eatinwo Number of words in the “eating” category sexuawo Number of words in the “sexuality” category familwo Number of words in the “family” category frienwo Number of words in the “friends” category content posemwo Number of words in the “positive emotion” category negemwo Number of words in the “negative emotion” category emotiwo Number of words in the “positive emotion” or “negative emotion” category swearwo Number of words in the “swear words” category affecwo Number of words in the “affective process” category feeliwo Number of words in the “feeling” category religwo Number of words in the “religion” category schoowo Number of words in the “school” category occupwo Number of words in the “occupation” category autotwe Automatic tweets ratio, i.e., |DzD |D| M| autweyn Presence of automatic tweets in D (binary feature) fsipron Number of words in the “I” category fplpron Number of words in the “we” category ssipron Number of words in the “you” category selfref Number of words in the “self” category hybrid negpart Number of words in the “negations” category asspart Number of words in the “assents” category article Number of words in the “articles” category preposi Number of words in the “prepositions” category pronoun Number of words in the “pronoun” category slangwo Number of words in the “chat acronyms” category Table 3. Features list. the performance of the method in terms of the indexes defined in Section 1. Finally, we chose, for each language-attribute pair, the best performing combination, in terms of accuracy or RMSE, as appropriate for that attribute. The resulting configurations are summarized in Table 4. In order to provide a synthetic baseline, we built 3 baseline methods using each of the 3 classifiers/regressors with all the features. The results, obtained by means of the same leave-one-out procedure, are shown in Table 5. It can be seen from Table 4 that our procedure lead us to chose a different configura- tion of classifier/regressor and features set for each language-attribute pair. There could be several reason to explain that. First, every language has its own writing rules and cul- ture, so it is possible that a middle aged English man could not have the same interests and the same writing style of a middle aged Italian man. Second, the Spanish, Dutch, and Italian dictionaries we used were not as good as the LIWC English one. Finally, the number of problem instances in the training set was not the same for every language, and so was the number of tweets in the instances within each language subset. References 1. Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001) 2. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011) 3. Golbeck, J., Robles, C., Edmondson, M., Turner, K.: Predicting personality from twitter. In: Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on. pp. 149–156. IEEE (2011) 4. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count (liwc): A computerized text analysis program. Mahwah (NJ) 7 (2001) 5. Rangel, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: Cappellato, L., Ferro, N., Gareth, J., San Juan, E. (eds.) CLEF 2015 Labs and Workshops, Notebook papers. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2015), http://www.clef-initiative.eu/publication/working-notes 6. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs 6, 199–205 (2006) 7. Soldz, S., Vaillant, G.E.: The big five personality traits and the life course: A 45-year longitudinal study. Journal of Research in Personality 33(2), 208–232 (1999) L Attribute Class./Regr. Chosen features set Gen rf2000 commas negemot exclmar Age rf2000 allpunc commas exclmar questma parenth numbers wocount longwor EN upcawor carrret atmenti extlink hashtag posemot negemot emotico autotwe Ext svm wocount questma parenth familwo Neu svm selfref fsipron chatacr affecwo emotiwo hashtag posemot pronoun wocount Con rf500 extlink longwor numbers hashtag fsipron selfref Agr svm questma atmenti allpunc ssipron article longwor jobword chatacr Ope rf2000 commas extlink hashtag exclmar questmar parenth wocount ssipron negpart article feeliwo moneywo jobword eatinwo familwo negemwo religwo Gen svm allpunc commas exclmar questma parenth numbers wocount long- wor upcawor carrret atmenti extlink hashtag posemot negemot ES fsipron fplpron ssipron selfref negpart asspart article preposi pronoun slangwo moneywo jobword sportwo televwo sleepwo eatinwo sexu- awo familwo frienwo posemwo negemwo affecwo feeliwo Age svm extlink hashtag numbers sleepwo sexuawo Ext rf2000 longwor carrret questma preposi autweyn emotico Neu rf2000 posemot ssipron exclmar selfref extlink Con rf500 extlink longwor numbers hashtag fsipron selfref affecwo emotiwo Agr svm allpunc commas exclmar questma parenth numbers wocount long- wor upcawor carrret atmenti extlink hashtag posemot negemot + fsipron fplpron ssipron selfref negpart asspart article preposi pronoun slangwo moneywo jobword sportwo televwo sleepwo eatinwo sexu- awo familwo frienwo posemwo negemwo swearwo religwo Ope rf2000 autotwe hashtag preposi wocount religwo Gen rf500 asspart fsipron selfref exclmar extlink hashtag emotiyn Ext svm allpunc wocount hashtag questma IT Neu rf2000 commas longwor fplpron chatacr autweyn Con svm commas extlink hashtag exclmar questmar parenth wocount ssipron negpart article feeliwo moneywo jobword eatinwo familwo negemwo religwo Agr svm posemot exclmar moneywo hashtag pronoun autweyn Ope svm negpart hashtag atmenti exclmar longwor Gen rf2000 negemot upcawor preposi Ext svm questma atmenti allpunc ssipron article longwor jobword chatacr NL extlink autweyn Neu rf2000 atmenti preposi longwor emotiyn Con svm hashtag questma exclmar atmenti posemot wocount extlink longwor Agr svm atmenti commas exclmar hashtag autweyn emotiyn Ope svm negpart hashtag atmenti exclmar longwor Table 4. Chosen classifier/regressor and features set for each language-attribute pair. Baselines L Attribute svm rf500 rf2000 Our conf. Gen 0.566 0.619 0.619 0.735 Age 0.614 0.617 0.605 0.692 EN Ext 0.185 0.182 0.181 0.165 Neu 0.243 0.226 0.226 0.208 Con 0.167 0.158 0.158 0.146 Agr 0.173 0.183 0.183 0.162 Ope 0.157 0.149 0.149 0.143 Gen 0.760 0.760 0.760 0.820 Age 0.400 0.404 0.416 0.580 ES Ext 0.185 0.177 0.176 0.156 Neu 0.243 0.220 0.220 0.202 Con 0.161 0.163 0.162 0.154 Agr 0.162 0.169 0.169 0.157 Ope 0.183 0.183 0.183 0.168 Gen 0.632 0.705 0.737 0.853 Ext 0.159 0.162 0.162 0.121 IT Neu 0.202 0.215 0.215 0.170 Con 0.126 0.135 0.136 0.113 Agr 0.159 0.165 0.165 0.150 Ope 0.186 0.178 0.177 0.102 Gen 0.611 0.344 0.333 0.633 Ext 0.131 0.140 0.139 0.105 NL Neu 0.206 0.205 0.204 0.156 Con 0.122 0.125 0.125 0.101 Agr 0.163 0.161 0.162 0.130 Ope 0.121 0.122 0.122 0.104 Table 5. Results of our configuration and the synthetic baselines. Accuracy is reported for Gen and Age, RMSE is reported for Ext, Neu, Con, Agr, and Ope.