A Machine Learning-based Intrinsic Method for Cross-topic and Cross-genre Authorship Verification Notebook for PAN at CLEF 2015 Yunita Sari and Mark Stevenson Department of Computer Science, University of Sheffield Regent Court, 211 Portobello Sheffield S1 4DP, United Kingdom E-mail:{y.sari, mark.stevenson}@sheffield.ac.uk Abstract This paper presents our approach for the Author Identification task in the PAN CLEF Challenge 2015. We identified the challenges of this year’s are the limited amount of training data and the problems in the sub-corpora are independent in terms of topic and genre. We adopted a machine learning based intrinsic method to verify whether a pair of documents have been written by same or different authors. Several content-independent features, such as function words and stylometric features, were used to capture the difference between documents. Evaluation results on the test corpora show our approach works best on the Span- ish data set with 0.7238 and 0.67 for the AUC and C@1 scores respectively. Keywords: Machine Learning, Intrinsic Method, Authorship Verification 1 Cross-topic and Cross-genre Authorship Verification Given a pair of documents (X,Y), the task of author verification is to identify whether the documents have been written by same or different authors. Compared to authorship attribution, the authorship verification task is significantly more difficult. Verification does not learn about the specific character of each author, but rather about the dif- ferences between a pair of documents. The problem is complicated by the fact that an author may consciously or unconsciously vary his/her writing style from text to text [5]. This year’s PAN lab Author Identification task focuses on cross-genre and cross- topic authorship verification. In this case, the genre and/or topic may differ significantly between the known and unknown documents. This task is more representative of real world applications where we could not control the genre/topic of the documents. The PAN Author idenfitication task is defined as follows: “Given a small set (no more than 5, possibly as few as one) of known documents by a single person and a questioned document, the task is to determine whether the questioned document was written by the same person who wrote the known document set. The genre and/or topic may differ significantly between the known and unknown docu-ments” [1] 1.1 Data set The data set consists of author verification problems in four different languages. In each problem, there are some known documents written by single person and only one un- known document. The genre and/or topic between documents may differ significantly. The document length varies from a few hundred to a few thousand words. Table 1 shows the sub-corpora including their language and type (cross-genre or cross-topic) Table 1: The authorship verification problems training data set Language Type Total_problems Dutch cross-genre 100 English cross-topic 100 Greek cross-topic 100 Spanish cross-genre 100 1.2 Performance measure The author verification systems are tested on a set of problems. The system must provide a probability score for each unknown document. The performance of the system will be evaluated using area under the ROC curve (AUC). In addition, the output will also be measured based on c@1 score [7]. Probability score which is greater than 0.5 is considered as positive answer, while a score lower than 0.5 is considered as negative. If the score is 0.5, then it will be considered as an i don’t know answer. The c@1 measure can be define as follows: 1   nc  c@1 = ∗ nc + nu ∗ (1) n n where: – n = number of problems – nc = number of correct answer – nu = number of unanswered problems The overall performance will be evaluated on the product of AUC and c@1 2 Methodological Approach We adopted a machine learning-based intrinsic method to address this verification prob- lem. Intrinsic methods use only the provided documents (in this case known and un- known documents) to determine whether they are written by same author or not. A ma- chine learning algorithm then will be trained on the labeled document pairs to construct a model which can be used to classify the unlabeled pairs. Note that in the verifica- tion problems, the machine learning does not learn about the specific character of each author, but rather about the differences between a pair of documents [6]. Texts are rep- resented by various types of features such as function word, character n-grams, word n-grams and several stylometric features. 2.1 Textual Representation As the genre and/or topic may differ significantly between the known and unknown documents, we can not rely on the content based features to capture the differences between documents. We therefore focused more on content-independent features such as function words and stylometric features. In addition, those features can be applied to any of the language used in the task. We used six types of features in total including: stylometric features (10), function words, character 8-grams, character 3-grams, word bigrams, and word unigrams. Given collection of problems P = {Pi : ∀i ∈ I} where I = {1, 2, 3, .., n} is the index of P . Pi contains exactly one unknown document U and a set of known documents K = {Kj : ∀j ∈ J} where J is the index of K and 1 ≤ J ≤ 5. Our approach represented each problem Pi as vector Pi = {R1 , R2 , .., Rn } where n is the maximum number of feature types (in our case are six). Ri is the distance of two similar feature vector representation of a set of known documents K and unknown document U . If K contains more than one document, then the generated feature vector is an average vector of J documents. Table 2 shows details of the features vector representation and comparison measures used. Table 2: List of features and comparison measures Feature Model Comparison method (R1) Stylometric features average feature’s presence min-max similarity (R2) Function words ratio function word to total number of Manhattan dis- words in the document tance (R3) Character 8-grams tf-idf cosine similarity (R4) Character 3-grams tf-idf cosine similarity (R5) Word bigrams tf-idf cosine similarity (R6) Word unigrams tf-idf cosine similarity Stylometric Features There are ten sytlometric features used in our experiment. Some features were adapted from Guthrie’s work [3] on anomalous text detection and were among the most effective features to separate anomalous segments from normal seg- ments of the text. The complete list of stylometric features are: 1. Average number of non standard word1 2. Average number of words per sentence 3. Percentage of short sentences (less than 8 words) 4. Percentage of long sentences (greater than 15 words) 5. Percentage word with three syllables 6. Lexical diversity (ratio of total number of unique words to total number of words in a document) 1 Enchant spell checking library (http://www.abisource.com/projects/enchant/) was used to identify non-standard English words 7. Total number of punctuations We also implemented three readibility measures: 1. Flesch-Kincaid Reading Ease [4]  total_words   total_syllables  ReadingEase = 206.835 − 1.015 − 84.6 total_sentences total_words (2) 2. Flesch-Kincaid Grade Level [4]  total_syllables   total_words  GradeLevel = 11.8 + 0.39 − 15.59 (3) total_words total_sentences 3. Gunning-Fog Index [2]  total_words   words_with_3_or_more_syllables   F ogIndex = + ×100 total_sentences total_words (4) 2.2 Distance Measures We experimented with several different comparison measures for computing similarity between a pair of vectors. We noticed particular comparison metric performs better in certain type of features, thus we applied different measure for each features type. Three different distance measures were used: Cosine similarity measure p P xi yi x.y d(x, y) = = s i=1 s (5) ||x||||y|| p P Pp xi 2 yi 2 i=1 i=1 Minimum maximum similarity measure p P min(xi , yi ) minmax(x, y) = i=1 p (6) P max(xi , yi ) i=1 City block distance (also called Manhattan distance or L1 distance) p X d(x, y) = |xi − yi | (7) i=1 2.3 Feature selection and classifier Our authorship identification software was written in Python. We applied feature se- lection using Extratreeclassifier and the SVM classifier. The classifier hyperparameters were optimized using the GridSearchCV. Scikit-learn library2 was used for both feature selection and classification. 3 Evaluation and Result 3.1 Training corpora We evaluated the approach on the training data using 10-fold cross validation. Since there are some incompatibility issues, we did not perform the verification task on Greek data. Table 3 shows the result of our approach on three of the sub-language corpora. The best result was achieved on the Spanish data set with 0.846 and 0.807 for the AUC and C@1 scores respectively. Compared to other sub-language corpora, the Spanish data set contains more known documents; which may explain why the results on this data set are better than the results on the other sub-languages data. We also observed that the use of some NLP libraries which are mainly trained on English data did not perform well on the non-English language data sets. Thus, feature selection was applied to remove unhelpful features. Table 3: 10-fold cross validation on the training corpora Data set AUC C@1 finalScore English 0.662 0.606 0.401 Dutch 0.618 0.553 0.342 Spanish 0.846 0.807 0.683 3.2 Testing corpora Table 4 shows the official result of our approach on test data released by PAN 15 or- ganizer. As predicted, our approach performed well on the data set which has more known documents. The best results were achieved on the Spanish data with final score of 0.48495. Our approach applied supervised learning where the performance depends strongly on the amount of training data. Thus, as can be seen in Table 4, our verification task did not obtain good results on English data which has only one known document per problem. However, in term of runtime, our approach generally more efficient since all necessary processing were performed in the training phase. 2 http://scikit-learn.org Table 4: Result on test data set Data set AUC C@1 finalScore Runtime English 0.4011 0.5 0.20055 00:05:46 Dutch 0.61306 0.62075 0.38056 00:02:03 Spanish 0.7238 0.67 0.48495 00:03:47 4 Conclusion This year’s author verification problem is considerably harder than last year’s since the number of known documents is very limited and the genre/topic between known and unknown documents differ significantly. In addition, for English, the data set was derived from Project Gutenberg’s opera play scripts which are an unusual type of text. We identified that the most challenging part of this task was to find suitable features which could capture the differences between documents. In addition, for certain data set, not all features were helpful. Thus applying feature selection were beneficial and greatly improved the accuracy of the classifier. References 1. PAN Authorship Identification Task 2015, https://www.uni- weimar.de/medien/webis/events/pan-15/pan15-web/author-identification.html 2. Gunning, R.: The Technique of Clear Writing. McGraw-Hill (1952) 3. Guthrie, D.: Unsupervised Detection of Anomalous Text. Ph.D. thesis (2008) 4. Kincaid, J.P., Fishburne, R.P., Rogers, R.L., Chissom, B.S.: Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Tech. Rep. February (1975) 5. Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. Twenty- first international conference on Machine learning - ICML ’04 p. 62 (2004) 6. Koppel, M., Winter, Y.: Determining if two documents are written by the same author. Jour- nal of the Association for Information Science and Technology 65(1), 178–187 (Jan 2014), http://doi.wiley.com/10.1002/asi.22954 7. Peñas, A., Rodrigo, A.: A simple measure to assess non-response. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp. 1415–1424 (2011)