Authorship Identification Using a Reduced Set of Linguistic Features Notebook for PAN at CLEF 2012 Stefan Ruseti, Traian Rebedea Department of Computer Science and Engineering University Politehnica of Bucharest, Romania stefan_ruseti@yahoo.com, traian.rebedea@cs.pub.ro Abstract. The proposed solution for authorship attribution combines a couple of the most important features identified in previous research in this domain with classification algorithms in order to detect the correct author. We consider that the most relevant aspect of our work is the small number of linguistic features and the use of the same framework to solve both the open and the closed class authorship problem, by only changing the classification algorithm. This approach obtained an overall 77% accuracy with regard to the total number of correctly classified documents. 1 Introduction The problem of authorship identification or attribution of text documents has been widely studied in the last decades, especially in the last 20 years, but the solutions are not mature enough to consider the problem solved. Nowadays, the Web offers very large amounts of texts to be used as corpora for authorship identification, but it also provides many different types of discourse that may be analyzed: from narratives and e-mails to online conversations and social network updates. It is obvious that each type of discourse should be treated independently; nevertheless even the problem of identifying the author of large narrative texts is far from being closed. Authorship attribution may be divided into two different subtasks: determining the most descriptive features of the texts under consideration, then applying a classification algorithm in order to detect the most probable author [1]. The methods for conducting the classification stage range from principal component analysis and cluster analysis to support vector machines (SVMs) and neural networks. The proposed solution has started from examining the most important features and most powerful classification algorithms developed for PAN 2011 [2]. Of course, as the discourse type has changed from e-mails to short narratives, the feature set also had to be changed as described in the next section. An extensive set of features used for previous works in authorship attribution has also been presented in [3]. The most successful team in PAN 2011 has used a broad set of features, including several specific to e-mail conversations, and a maximum entropy classifier [4]. Another approach used a semi-supervised approach based just on character trigrams and SVM in order to determine documents in the test corpus that had the highest probability of being written by one of the authors. These documents were then added to the training set and the classifiers were retrained [5]. A last interesting approach that offered good results was a voting approach that used several classifiers that might elect of veto directly the author of a document. The results of all the classifiers were combined [6]. The following section describes the small feature set chosen for solving the proposed problem, while section 3 briefly presents the choice of the classifier. Then, we describe the results that were obtained and how the answers for the open problem were derived and wrap up with some conclusions. 2 The Reduced Set of Linguistic Features In the multitude of approaches to author identification over the last years, a large list of features was used, ranging from character and lexical to the semantic layer. Of course, some of them need to be problem specific [3]. From all these features, we extracted a reduced set that proved to be suitable for the type of discourse in the PAN 2012 corpus. Thus, features related to the layout of the text or spelling errors were irrelevant because the corpus was composed of short narratives (or novels), which are usually written correctly and most times are also edited. The remaining features used to describe the texts and to solve the classification problem are: • Character trigrams – the most common 100 trigrams from the training corpus were selected and then the distribution of each text has been computed. • POS bigrams and trigrams – the most common 50 bigrams and 100 trigrams were selected. The POS tagging was realized using the RITa POS Tagger1. However, as most other taggers, it returned very specific POS tags that did not offer enough generality needed to extract each author’s style. For example, nnps was for proper noun plural, but only the first letter from these tags was considered sufficient and descriptive for the author’s writing style. • Suffixes – the most common 32 English suffixes were counted in each text. The percentage of suffixed words from all words was recorded as well. To check if a word has a certain suffix, we first checked to see if the word was composed using a suffix by using a stemmer. After this test, we only checked if the word ended with a suffix from the list. This approach is not 100% correct, but it had a very small error rate that did not influence the classification. • Word length – word lengths from 1 to 15 were counted, any word longer than 15 characters was considered in the 15 category. These features should capture the author’s vocabulary richness. • Syntactic complexity and structure – we used the Stanford parser2 to create the parsing tree for each sentence and to extract the syntactic dependencies. The average sentence length, the average and the maximum tree depth, the average and 1 http://www.rednoise.org/rita/documentation/ripostagger_class_ripostagger.htm 2 http://nlp.stanford.edu/software/lex-parser.shtml the maximum distance between the elements of a dependency were recorded. Each dependency type was also counted to try to represent the author’s predisposition for certain syntactic structures. • Percentage of direct speech – some authors may tend to use more dialogue in their texts than others, so also took this under consideration. Sometimes, this feature can be irrelevant because the type of the text can also determine the percentage of dialogue. In the evaluation stage, this feature increased the overall accuracy, so we decided to use it. Each feature was normalized so that the lengths of the texts do not interfere with the results. Because there were only 2 training texts for each author, we split each one into smaller pieces. The cross-validation for only 2 examples would have been very irrelevant, because only one text would remain as a training example, so no generalization could be made. For the sets A and C 5kB pieces were used, and for set I 50kB. This produced 100-200 training examples for each author, so a better generalization could be made by the classifier. The splitting took into account sentences, so it would not interfere with the syntactic features. Also, the last slice could not be smaller than half of the average slice size in the training set. 3 Classification Task For classification we used a SVM implementation available in WEKA, the Sequential Minimal Optimization (SMO) algorithm. The test documents were also split into pieces of the same size as the training data, and the most common result was used as the output of the classifier each document. For the open class problem, we used the same classifier, but with a logistic regression model for the output because we needed a more exact probability estimation for each author in the training set. If the classifier offered an expected value over 0.75 for an author for a text on average (the text was also split into pieces of different sizes), the classifier outputs that class; otherwise the answer is “other”. We have also tried a Naive Bayes classifier, but the results were not as good as for SVM when using cross-validation on the training set. However, it offered very close results, so it can also be a viable classifier. 4 Results In order to determine the reduced feature set presented in section 2, different combinations of features have been evaluated and we have selected the ones that had the best results in cross-validation. Different split sizes were used for texts as well. The experimental validation concentrated only on the closed attribution problem. In the 10-fold cross-validation, the results were very good: • 100% - set A (using 5kB and 10kB slices) • 96.6% - set C (using 5kB slices) • 99.5% - set I (using 20kB and 50kB slices) However, these results are not very relevant, because the training examples are from the same document, so one expects many linguistic similarities between them. It was clear that the results on the test corpus will be significantly under these levels. However, the described approach turned out to yield good results on the PAN 2012 test corpora as well, both for the closed and open problems: • A – 4/6 (66.66%) • B – 8/10 (80%) • C – 6/8 (75%) • D – 12/17 (70.58%) • I – 12/14 (85.71%) • J – 13/16 (81.25%) As expected, the results are not as good as for cross-validation, but the depreciation was not very steep. Thus, our solution obtained an overall document accuracy of 77%, ranking 3rd in the author attribution competition and an average accuracy over all the 6 test sets of 76%, ranking 7th, at a very close distance from the previous 4 places. 5 Conclusions Using only a reduced set of linguistic features has proven to offer good results for the author identification task. These results might have improved by adding more application specific features. Moreover, spiting the training texts proved to be a good solution for training, evaluation and scoring the test documents. The last conclusion is that using logistic regression over the solution designed for the closed class problem provided competitive results for the open class problem as well. References 1. Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1 (2006) 233-334 2. Argamon, S., Juola, P.: Overview of the International Authorship Identification Competition at PAN-2011. In: Petras, V., Forner, P., Clough, P.D. (eds.): CLEF 2011 (Notebook Papers/Labs/Workshop) 3. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60 (2009) 538-556 4. Tanguy, L., Urieli, A., Calderone, B., Hathout, N., Sajous, F.: A Multitude of Linguistically- rich Features for Authorship Attribution. In: Petras, V., Forner, P., Clough, P.D. (eds.): CLEF 2011 (Notebook Papers/Labs/Workshop) 5. Kourtis, I., Stamatatos, E.: Author Identification Using Semi-supervised Learning. In: Petras, V., Forner, P., Clough, P.D. (eds.): CLEF 2011 (Notebook Papers/Labs/Workshop) 6. Kern, R., Seifert, C., Zechner, M., Granitzer, M.: Vote/Veto Meta-Classifier for Authorship Identification. In: Petras, V., Forner, P., Clough, P.D. (eds.): CLEF 2011 (Notebook Papers/Labs/Workshop)