Exploring Information Retrieval features for Author Profiling Notebook for PAN at CLEF 2014 Edson R. D. Weren, Viviane P. Moreira, and José P. M. de Oliveira Institute of Informatics UFRGS - Porto Alegre - Brazil {erdweren,viviane,palazzo}@inf.ufrgs.br Abstract This paper describes the methods we have employed to solve the au- thor profiling task at PAN-2014. Our goal was to rely mainly on features from Information Retrieval to identify the age group and the gender of the author of a given text. We describe the features, the classification algorithms employed, and how the experiments were run. Also, we provide an analysis of our results compared to other groups. 1 Introduction Author profiling deals with the problem of finding as much information as possible about an author, just by analysing a text produced by that author. This is a challenging task which has applications in forensics, marketing, and security [1]. This paper reports on the participation of the INF-UFRGS team at the second edition of the author profiling task, organised in the scope of the PAN Workshop series, which is collocated with CLEF2014. More details about the task and the workshop can be found in [2,5] The task requires that participating teams come up with approaches that take a text as input and predict the gender (male/female) and the age group (18-24, 25-34, 35-49, 50-64, or 64+) of its author. 2 Features The texts from each author, or documents, were represented by a set of 64 features (or attributes), which were divided into five groups. Next, we explain each of these groups. Length These are simple features that calculate the absolute length of the text. – Number of Characters; – Number of Words; and – Number of Sentences. Information Retrieval This is the group of features that encode our assumption that authors from the same gender or age group tend to use similar terms and that the dis- tribution of these terms would be different across genders and age groups. The process here was the same as in [6]. The complete set of texts is indexed by an Information 1164 Retrieval (IR) System. Then, the text that we wish to classify is used as a query and the k most similar texts are retrieved. The ranking is given by the cosine or Okapi metrics as explained below. We employ a total of 30 IR-based features. – Cosine female_cosine_sum, male_cosine_sum, female_cosine_count, male_cosine_count, female_cosine_avg, male_cosine_avg, 18-24_cosine_sum, 25-34_cosine_sum, 35-49_cosine_sum, 50-64_cosine_sum, 65-xx_cosine_sum, 18-24_cosine_count, 25-34_cosine_count, 35-49_cosine_count, 50-64_cosine_count, 65-xx_cosine_count, 18-24_cosine_avg, 25-34_cosine_avg, 35-49_cosine_avg, 50-64_cosine_avg, 65-xx_cosine_avg. These features are computed as an aggregation function over the top-k results for each age/gender group obtained in response to a query composed by the key- words in the text that we wish to classify. We tested three types of aggregation functions, namely: count, sum, and average. For this featureset, queries and doc- uments were compared using the cosine similarity (Eq. 1). For example, if we re- trieve 100 documents in response to a query composed by the keywords in q, and 50 of the retrieved documents were in the 18-24’s age group, then the value for 18-24_cosine_avg is the the average of the 50 cosine scores for this class. Similarly, 18-24_cosine_sum is the summation of such scores, and 18-24_cosine_count simply counts how many retrieved documents fall into the 18-24_cosine_count category. → −c · → −q cosine(c, q) = → − → − (1) | c || q | where →−c and →− q are the vectors for the document and the query, respectively. The vectors are composed of tfi,c × idfi weights where tfi,c is the frequency of term i N in document c, and IDFi = log n(i) where N is the total number of documents in the collection, and n(i) is the number of documents containing i. – Okapi BM25 female_okapi_sum, male_okapi_sum, female_okapi_count, male_okapi_count, female_okapi_avg, male_okapi_avg, 18-24_okapi_sum, 25-34_okapi_sum, 35-49_okapi_sum, 50-64_okapi_sum, 65-xx_okapi_sum, 18-24_okapi_count, 25-34_okapi_count, 35-49_okapi_count, 50-64_okapi_count, 65-xx_okapi_count, 18-24_okapi_avg, 25-34_okapi_avg, 35-49_okapi_avg, 50-64_okapi_avg, 65-xx_okapi_avg . Similar to the previous featureset, these features compute an aggregation function (average, sum, and count) over the the retrieved results from each gender/age group that appeared in the top-k ranks for the query composed by the keywords in the document. For this featureset, queries and documents were compared using the Okapi BM25 score (Eq. 2). n X tfi,c · (k1 + 1) BM 25(c, q) = IDFi |D| (2) i=1 tfi,c + k1 (1 − b + b avgdl ) where tfi,c and IDFi are as in Eq. 1 |d| is the length (in words) of document c, avgdl is the average document length in the collection, k1 and b are parameters that tune the importance 1165 of the presence of each term in the query and the length of the text. In our experiments, we used k1 = 1.2 and b = 0.75. Readability Readability tests indicate the comprehension difficulty of a text. – Flesch-Kincaid readability tests We employ two tests that indicate the comprehension difficulty of a text: Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) [4]. They are given by Eqs. 3 and 4. Higher FRE scores indicate a material that is easier to read. For example, a text with a FRE scores between 90 and 100 could be easily read by a 11 year old, while texts with scores below 30 would be best understood by un- dergraduates. FKGL scores indicate a grade level. A FKGL of 7, indicates that the text is understandable by a 7th grade student. Thus, the higher the FKGL score, the higher the number of years in education required to understand the text. The idea of using these scores is to help distinguish the age of the author. Younger authors are expected to use shorter words and thus have a smaller FKGL and a high FRE.     #words #syllables F RE = 206.835 − 1.015 − 84.6 (3) #sentences #words     #words #syllables F KGL = 0.39 + 11.8 − 15.59 (4) #sentences #words Correctness This group of features aims at capturing the correctness of the text. – Words in the dictionary: ratio between the words from the text found in the OpenOffice US dictionary1 and the total number of words in the text. – Cleanliness: ratio between the number of characters in the preprocessed text and the number of characters in the raw text. The idea is to assess how "clean" the original text is. – Repeated Vowels: in some cases, authors use words with repeated vowels for emphasis. e.g. "I am soo tired". This group of features counts the numbers of re- peated vowels (a, e, i, o, and u) in sequence within a word. – Repeated Punctuation: this features compute the number of repeated punc- tuation marks (i.e., commas, semi-colons, full stops, question marks, and exclamation marks) in sequence in the text. Style – HTML tags: this feature consists in counting the number of HTML tags that indi- cate line breaks
, images , and links . – Diversity: this feature is calculated as the ratio between the distinct words in the text and the total number of words in the text. 1 http://extensions.openoffice.org/en/project/ english-dictionaries-apache-openoffice 1166 Table 1. Top 5 features in terms of Information Gain Age Gender Corpus Lang Top 5 features IG Type Top 5 features IG Type 18-24_okapi_sum 0.083 IR male_okapi_avg 0.160 IR 50-64_cosine_sum 0.083 IR 25-34_okapi_avg 0.154 IR Twitter EN 25-34_okapi_sum 0.081 IR male_okapi_sum 0.153 IR 25-34_cosine_sum 0.077 IR 35-49_okapi_avg 0.152 IR 18-24_cosine_sum 0.075 IR female_okapi_avg 0.140 IR 0.140 Style number of words 0.183 Length 25-34_okapi_count 0.136 IR words in the dictionary 0.157 Correctness Twitter ES 25-34_cosine_sum 0.129 IR male_okapi_sum 0.155 IR 25-34_cosine_count 0.123 IR diversity 0.149 Style 50-64_cosine_sum 0.114 IR male_cosine_sum 0.148 IR diversity 0.000 Style female_cosine_sum 0.156 IR male_okapi_sum 0.000 IR male_okapi_count 0.146 IR Blog EN male_okapi_count 0.000 IR female_okapi_count 0.137 IR female_okapi_count 0.000 IR female_cosine_count 0.118 IR female_okapi_sum 0.000 IR cleanliness 0.114 Correctness 25-34_cosine_sum 0.260 IR number of words 0.251 Length words in the dictionary 0.231 Correctness words in the dictionary 0.226 Correctness Blog ES 50-64_okapi_avg 0.224 IR repeated_e 0.206 Correctness 50-64_okapi_sum 0.224 IR 50-64_okapi_avg 0.200 IR 25-34_cosine_count 0.223 IR male_okapi_sum 0.194 IR 50-64_cosine_sum 0.122 IR female_cosine_count 0.008 IR 50-64_cosine_count 0.122 IR female_cosine_sum 0.007 IR SocialMedia EN 35-49_cosine_count 0.117 IR female_okapi_count 0.007 IR 18-24_cosine_count 0.116 IR male_okapi_count 0.007 IR 35-49_cosine_sum 0.114 IR male_cosine_count 0.006 IR 18-24_okapi_count 0.200 IR female_cosine_count 0.081 IR 50-64_okapi_count 0.200 IR female_cosine_sum 0.079 IR SocialMedia ES 18-24_cosine_count 0.193 IR male_cosine_count 0.071 IR 35-49_cosine_count 0.191 IR 25-34_cosine_avg 0.053 IR 18-24_cosine_sum 0.189 IR female_okapi_count 0.052 IR 65-XX_cosine_sum 0.098 IR female_okapi_count 0.106 IR 25-34_okapi_count 0.098 IR male_okapi_count 0.106 IR Reviews EN 25-34_cosine_count 0.087 IR female_cosine_count 0.079 IR 65-XX_cosine_count 0.083 IR male_cosine_count 0.079 IR 65-XX_okapi_count 0.082 IR female_cosine_sum 0.072 IR 3 Usefulness of the Features In order to evaluate how discriminant each of the 64 features described in Section 2 is, we calculated their information gain with respect to the class. The five highest ranking features for each corpus and each class are shown in Table 1. The vast majority of the most discriminative features is from the IR group. Style, length, and correctness also appear, but at a much lower frequency. For Age-Blogs-EN, none of our features had a good score for information gain. Interestingly, we got the best scores for this corpus on the test data, compared to other groups. Information gain evaluates each feature independently from each other. However, when selecting the best group of features, we wish to avoid redundant features by keep- ing features that have at the same time a high correlation with the class and a low intercorrelation. With this aim, we used Weka’s [3] subset evaluators to select good sub- sets of features. These subsets are shown in Table 2. The number of attributes in these 1167 Table 2. Best subset of features for each corpus Corpus Lang Age Gender 18-24_cosine_sum 18-24_cosine_count male_okapi_count male_okapi_sum Twitter EN 35-49_okapi_count repeated_e repeated_exclamation 50-64_cosine_sum 65-XX_cosine_count 25-34_okapi_sum male_cosine_sum 25-34_okapi_count male_cosine_count Twitter ES words_in_dictionary words_in_dictionary repeated_exclamation number_of_characters repeated_e repeated_semicolon male_cosine_avg 50-64_okapi_count female_cosine_sum Blog EN male_cosine_count repeated_exclamation female_okapi_count repeated_interrogation 65-XX_cosine_count repeated_e Blog ES 65-XX_cosine_avg repeated_exclamation 25-34_okapi_sum female_cosine_avg male_cosine_count male_cosine_avg 18-24_cosine_sum 25-34_cosine_avg 35-49_cosine_count 35-49_cosine_avg female_okapi_count SocialMedia EN 18-24_okapi_count FRE 65-XX_okapi_avg FKGL repeated_exclamation repeated_i repeated_interrogation repeated_fullstop 50-64_cosine_sum 18-24_cosine_count female_cosine_sum female_okapi_sum male_cosine_avg male_okapi_count male_okapi_count 18-24_okapi_sum 18-24_okapi_count SocialMedia ES 18-24_okapi_count FKGL 18-24_okapi_avg repeated_a repeated_i number_of_characters repeated_u repeated_a repeated_exclamation repeated_ponto female_cosine_avg 18-24_cosine_sum 65-XX_cosine_sum 65-XX_cosine_count 65-XX_okapi_sum 25-34_okapi_count female_cosine_sum 65-XX_okapi_count 50-64_okapi_count FKGL Reviews EN 65-XX_okapi_count number_of_characters diversity repeated_i repeated_semicolon repeated_o repeated_comma repeated_semicolon repeated_exclamation cleanliness diversity 1168 subsets varied a lot, from one (Gender-Twitter-EN) to 16 (Age-Reviews-EN). Again, we observed that most features in the subsets are IR-based. Surprisingly, readability features (namely FKGL) appear in only two subsets for Age. Style and correctness at- tributes also appear in the chosen subsets. Also, we noticed that some features that were intended for age, have been selected as useful for gender and vice-versa. 4 Official Experiments We treated gender and age classification separately. Thus, the features described in the previous section were used to train one classifier for each corpus for gender and age resulting in 14 classifiers. We used Weka [3] to build the machine learning models. A number of algorithms was tested, namely: BayesNet, Logistic, MultilayerPerceptron, SimpleLogistic, LogitBoost, RotationForest, and MetaMultiClass. We chose the algo- rithm which got the best result for the training data using 10-fold cross-validation. To make such choice, we analysed the results of the classifiers in two scenarios: using all 64 attributes and using just the attributes in the best subset. The preprocessing consisted basically in tokenisation, removal of tags, and escape characters. No stemming or stopword removal was performed. All training instances were used to generate the model. No attempt to remove noise was taken. Table 3 shows our official results for both training and test corpora in terms of accu- racy. It also shows which classification algorithm was used and whether all attributes or just a subset were used. Most classifiers (11 out of 14) used just the subset of attributes, as their results on the training data outperformed (or got very close to) the results using all attributes. As expected, results on the training corpora were superior to the results on the test corpora. The biggest drop was for Age-Blog-ES as in this corpus, in which accuracy dropped by half. Interestingly, the results for three corpora were better on the test data (Age-Twitter-ES, Age-Blogs-EN, and Gender-Twitter-ES). We still need to investigate these differences further. Table 3. Official Results Age Corpus Lang Training Test Classifier Attributes Twitter EN 0.5261 0.3312 LogitBoost Subset Twitter ES 0.5056 0.5222 RotationForest Subset Blog EN 0.4558 0.4615 MultiClassClassifier Subset Blog ES 0.5455 0.2500 LogitBoost Subset SocialMedia EN 0.4251 0.3489 Logistic All SocialMedia ES 0.4866 0.4382 Logistic Subset Reviews EN 0.3762 0.3343 Logistic Subset Gender Corpus Lang Training Test Classifier Attributes Twitter EN 0.7876 0.5714 Logistic Subset Twitter ES 0.4494 0.5333 Logistic All Blog EN 0.8299 0.6410 MultilayerPerceptron Subset Blog ES 0.7955 0.5357 RotationForest Subset SocialMedia EN 0.5704 0.5361 SimpleLogistic Subset SocialMedia ES 0.7020 0.6307 SimpleLogistic All Reviews EN 0.7103 0.6778 SimpleLogistic Subset 1169 0.15 0.1 0.05 0 age gender age gender age gender age gender age gender age gender age gender -0.05 Twitter SocialMedia Blogs Reviews Twitter SocialMedia Blogs -0.1 English Spanish -0.15 -0.2 Figure 1. Comparison against the mean results of all participants We also analysed our results compared against the mean of all participants. These are shown in Figure 1. For 9 out of 14 cases, our results were above the mean. The case with the biggest gain was Age-Blogs-EN, in which the advantage was of 31%. In 5 runs, our results were at or below the mean. Our worst scores were for Age-Blogs-ES, in which our loss was of nearly 66%. Adding up all gains and losses, we get a positive result of 10% in relation to the average. 5 Conclusion This paper describes our participation in the Author Profiling task run in PAN 2014. We used the training data to build classifiers using several machine learning algorithms. Our focus was on exploring Information Retrieval-based features. The official results show that our scores were above the mean for all participants in most cases (9 times out of 14). Author profiling is a challenging task. Consequently, there are many possibilities for future work. As a first step, once the test data is released, we will further investigate the cases in which our system fails or succeeds in the classification. The goal is to try and establish patterns. We are also interested in testing methods for instance selection to improve our classification models. In addition, we have treated gender and age clas- sification separately as independent problems. However, since some attributes meant to discriminate gender were found useful for age (and vice-versa), we wish to explore the influence of both types of classification into each other. Acknowledgements: This work has been partially supported by CNPq-Brazil (478979/2012-6). We thank Anderson Kauer for his help in revising this paper. We thank Martin Potthast, Francisco Rangel, and other members of the PAN organising team for their help in getting our software to run. 1170 References 1. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Commun. ACM 52(2), 119–123 (Feb 2009) 2. Gollub, T., Potthast, M., Beyer, A., Busse, M., Pardo, F.M.R., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation - plagiarism detection, author identification, and author profiling. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF. Lecture Notes in Computer Science, vol. 8138, pp. 282–302. Springer (2013) 3. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (Nov 2009) 4. Kincaid, J.P., Fishburne, R.P., Rogers, R.L., Chissom, B.S.: Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Tech. rep., National Technical Information Service, Springfield, Virginia (Feb 1975) 5. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at pan 2013. In: Notebook Papers of CLEF 2013 LABs and Workshops, CLEF-2013, Valencia, Spain, September. pp. 23–26 (2013) 6. Weren, E.R.D., Kauer, A.U., Mizusaki, L., Moreira, V.P., Oliveira, J.P.M.D., Wives, L.: Examining multiple features for author profiling. Journal of Information and Data Management (JIDM) 5(1) (October 2014), to appear. 1171