=Paper=
{{Paper
|id=Vol-2936/paper-197
|storemode=property
|title=Feature Vector Difference based Authorship Verification for Open-World Settings
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-197.pdf
|volume=Vol-2936
|authors=Janith Weerasinghe,Rhia Singh,Rachel Greenstadt
|dblpUrl=https://dblp.org/rec/conf/clef/WeerasingheSG21
}}
==Feature Vector Difference based Authorship Verification for Open-World Settings==
Feature Vector Difference based Authorship Verification for Open-World Settings Notebook for PAN at CLEF 2021 Janith Weerasinghe1 , Rhia Singh2 and Rachel Greenstadt1 1 New York University, 6 MetroTech Center, Brooklyn, NY 11201, United States of America 2 Macaulay Honors College (Hunter CUNY), 695 Park Avenue, New York, NY 10065, United States of America Abstract This paper describes the approach we took to create a machine learning model for the PAN 2021 Au- thorship Verification Task. The goal of this task is to predict if a given pair of documents are written by the same author. For each document pair, we extracted stylometric features from the documents and used the absolute difference between the feature vectors as input to our classifier. Our new model is similar to out last year’s model with minor improvements to the feature set and the classifier. We trained two models on the two small and large datasets which achieved AUCs of 0.967 and 0.972 in the final evaluations. Keywords Authorship Verification, Stylometry, Machine Learning, Natural Language Processing 1. Introduction This paper presents our approach for the Authorship Verification Shared Task [1] at PAN 2021[2]. The objective of this task was to create a model that would be able to predict if two given documents were written by the same person. This year’s shared task used the same training data as last year, but is more challenging because the “test set [is] made entirely of unseen authors and topics.”1 This requires our model to be both topic agnostic and work robustly in an open-world setting. Our new model follows our approach[3] from the PAN2020 authorship verification task [4] with improvements made to address these new challenges. The dataset provided for this task was compiled by Bischoff et al. [5] and contains English documents from fanfiction.net. Each record in the dataset consists of two documents which may or may not be written by the same person and the fandom that each document was categorized under. The ground truth specifies the author identifiers for each document and the prediction target indicating if the two documents were written by the same person. The training dataset for the shared task was available in two sizes: a smaller dataset with 52, 590 records and a larger dataset with 275, 486 records, with each document containing on average about 21, 000 characters and 4, 800 tokens. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " janith@nyu.edu (J. Weerasinghe); rhia.singh@macaulay.cuny.edu (R. Singh); greenstadt@nyu.edu (R. Greenstadt) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://pan.webis.de/clef21/pan21-web/author-identification.html 2. Approach The goal of the PAN 2021 authorship verification shared task [2] was to predict if two given documents (𝐷𝑖 and 𝐷𝑗 ) were written by the same person. We modeled this as a binary clas- sification problem, in which the input to our classifier is a feature vector encoding the two documents (𝑋) and the target variable (𝑌 ), indicating whether or not the two documents were written by the same author. We trained two models (trained on the smaller and larger datasets) using an identical approach for both datasets. Our approach was implemented on Python primarily using NLTK [6] and Scikit Learn [7] libraries, and the source code is available at: https://github.com/janithnw/ pan2021_authorship_verification. 2.1. Preparing the datasets We divided the provided small and large datasets into training and testing sets, with the training sets roughly containing 75% of the total records. Since the model needs to work in an open- world setting, we needed to ensure that the authors included in the training set did not appear in the test set. To do this split, we started by iterating through all the authors in the dataset. Then, we randomly decided for each author if they would be placed in the training set (with a 75% chance) or the testing set (with a 25% chance). Then for each selected author, we also find all the other authors that appear with them in the dataset and include those authors and their associated records into the same partition as the original author. Note that we do this search in a recursive manner so that once a single author is included in a partition, all the authors that they appear with, and all the other authors that the newly selected authors appear with, are all included in the same training or test dataset. However, we did not take into account the topics (or fandoms) when doing these splits. Therefore it is possible that both the training and testing sets contain fan-fictions written about the same fandom. 2.2. Preprocessing We ran each document in the dataset through a series of pre-processing steps before feature extraction. The outputs of the preprocessing steps are stored together with the document, which is passed to the feature extraction step in our pipeline. We will use the following sentence as a running example in this section: “The Soviets had already been merciless, ruthless as the next army.” Tokenizer: We used the NLTK’s casual_tokenize method, which uses their TweetTokenizer to tokenize the documents. Our initial observations found this method to perform better at han- dling punctuation marks and words than the default Treebank Word Tokenizer. The tokenized version of the document is stored to be used in the next pre-processing steps and to be used in feature extraction steps. Part-of-Speech (POS) Tagging: We used NLTK’s Perceptron Tagger to perform the parts of speech tagging. The POS tags are stored together with the document, which are used in the next preprocessing steps and in feature extraction. The following would be the output of our POS-tagger for the example sentence above: [(’The’, ’DT’), (’Soviets’, ’NNPS’), (’had’, ’VBD’), (’already’, ’RB’), (’been’, ’VBN’), (’merciless’, ’RB’), (’,’, ’,’), (’ruthless’, ’NN’), (’as’, ’IN’), (’the’, ’DT’), (’next’, ’JJ’), (’army’, ’NN’), (’.’, ’.’)] Generating a Partial Parse Tree (POS Tag Chunking): We trained a Maxent (Maximum Entropy) classifier using the CoNLL 2000 corpus[8] to do POS tag chunking following the example provided by Birdet al. [9] in their NLTK book (Chapter 07). The following would be the output of our parser for the example sentence above: (S (NP The/DT Soviets/NNPS) (VP had/VBD already/RB been/VBN) (NP merciless/NN) ,/, (NP ruthless/NN) (PP as/IN) (NP the/DT next/JJ army/NN)) 2.3. Features This section lists the features that we extracted from the preprocessed data. Most of these features are used in our previous work [3] and are commonly used in previous stylometry work [10]. We used some features that are described in Writeprints feature set [11]. We also believed that the syntactic structure of sentences would provide valuable signals to the classifier. Following prior work [12, 13], we included POS-Tag n-grams and partial parses (or POS-Tag chunks) as part of our feature set. The use of parse trees to extract stylometric features, called syntactic dependency-based n-grams of POS tags, was introduced by Sidorov et al. [14]. We used a slightly different approach to encode parse tree features (described below), capturing how different noun and verb phrases are constructed. Several of our features described below are computed in terms of TF-IDF values. We used SKLearn’s TFIDFVectorizer to compute the TF-IDF vectors for the documents. We set the min_df parameter to be 0.1 to ignore tokens that have a document frequency less than 10%. Features that are new or changed in this year’s model are denoted with an asterisk (*). • Character n-grams*: TF-IDF values for character n-grams, where 1 ≥ 𝑛 ≥ 3. In our last year’s model, we included up to character-6-grams. We believed this resulted in our model being slightly affected by topic similarities[15]. To avoid this bias, our current model only includes character tri-grams. • POS-Tag n-grams: TF-IDF value of POS-Tag tri-grams. • Special Characters: TF-IDF values for 31 pre-defined special characters. • Frequency of Function Words*: Frequencies of 851 common English words2 . • Average number of characters per word: The average number of characters per token. • Distribution of word-lengths (1-10): The fraction of tokens of length 𝑙, where 1 ≤ 𝑙 ≤ 10 2 Downloaded from https://countwordsfree.com/stopwords • Vocabulary Richness*: In this year’s model, we included several measures of vocabulary rich- ness. The first is the ratio of hapax-legomena and dis-legomena, which was included in last year’s model. Here, hapax-legomena is the number of words that only occur once in the docu- ment and dis-legomena is the number of words that occur twice. In addition, we included the following measures: Type-token ratio, Guiraud’s R[16], Herdan’s C[17, 18], Dugast’s k and U[19], Maas’ 𝑎2 [20], Tuldava’s LN[21], Brunet’s W[22], Carroll’s CTTR[23], Summer’s S, Sichel’s S[24], Michéa’s M[25], Honoré’s H[26], Herdan’s 𝑉𝑚 [27], entropy, Yule’s K[28], and Simpson’s D[29]. We used the implementation of these algorthms in the Python textcomplexity package3 . • POS-Tag Chunks: TF-IDF values for Tri-grams of POS-Tag chunks. Here, we consider the to- kens at the second level of our parse tree. For example, for the sentence above, the input to our vectorizer would be [’NP’, ’VP’, ’NP’, ’,’, ’NP’, ’IN’, ’NP’, ’.’]. • POS chunk construction: TF-IDF values of each noun phrase, verb phrase, and prepositional phrase expansion. For the sentence above, these expansions are [’NP[DT NNPS]’, ’VP[VBD RB VBN]’, ’NP[NN]’, ’NP[NN]’, ’NP[DT JJ NNP]’] • Stop-word and POS tag hybrid tri-grams*: To capture stylistic information about word order while also preventing topic related biases, we replaced all the words other than the function words with their part-of-speech tag and computed the TF-IDF values of the tri-grams from this modified text. Similar methods of text distortion have been used successfully in previous studies[30, 31]. • Part-of-Speech tag ratios* Following the work of Castro-Castro et al. [32] who computed the ratio of nouns and adjectives, we calculated the proportion of all parts of speech tags in the Penn Treebank POS Tag collection in an attempt to better capture the syntactic composition of the text. • Unique spellings*: The fraction of words that are present in the document that belong to each of the following dictionaries: commonly misspelled English words4 , common typos when com- municating online 5 , common errors with determiners 6 , British spelling of words 7 , and popular online abbreviations 89 . We fit our feature extractors on the training sets. We also standardize features by removing the mean and scaling to unit variance. Then, we take the absolute vector difference between the feature vectors corresponding to each document pair. We then apply a secondary scaling step to ensure that the vector differences are standardized as well. This step was necessary for the stochastic gradient descent algorithm that we used to train our logistic regression classifier. More formally, given documents 𝐷𝑖 and 𝐷𝑗 , we represent their scaled feature vectors as 𝑋𝑖 and 𝑋𝑗 . Then we compute the vector difference as 𝑋 = |𝑋𝑖 − 𝑋𝑗 |. Then the input to our classifier will be the scaled version of 𝑋. 2.4. Classifier We computed the features for each document pair in the two datasets (smaller and larger) as described in the previous section. Our previous experience showed that Logistic Regression classifier worked the best with our approach. Since the complete feature matrix cannot be stored in-memory, we used a Stochastic Gradient Descent training algorithm with a logarithmic loss function, which results in a logistic regression classifier. We used SKLearn’s SGDClassifier implementation. We found the best value for the alpha parameter using RandomizedSearchCV and running the search on a sample of training records. We ran the SGDClassifier for 50 iterations. 3 https://github.com/tsproisl/textcomplexity 4 https://www.mentalfloss.com/article/629813/100-commonly-misspelled-words-english 5 https://www.lexico.com/grammar/common-misspellings 6 https://www.ef.edu/english-resources/english-grammar/determiners/ 7 https://www.lexico.com/grammar/british-and-spelling 8 https://preply.com/en/blog/2020/05/07/the-most-used-internet-abbreviations-for-texting-and-tweeting 9 https://englishstudyhere.com/abbreviations-contractions/50-common-internet-abbreviations/ 3. Results Table 1 shows the results of our two models under different test datasets and settings. The models are evaluated on 5 measures: area under the ROC curve (AUC), F1-score, c@1 (a variant of the F1-score, which rewards systems that leave difficult problems unanswered [33]), F_0.5u (a measure that puts more emphasis on deciding same-author cases correctly [34]), and the complement of the Brier score [35]. We submitted our smaller model during the early submission phase. These results were obtained before we incorporated the new vocabulary richness measures, stop-word and POS tag hybrid tri-grams features, POS tag ratios, and unique spellings to our feature set. Table 1 shows the results of our two models under different settings. Once the final models were trained, we deployed these models to the TIRA evaluation system [36] provided by the PAN 2021 organizers where the models were evaluated on an unseen dataset. Table 1 Results from our local evaluations, early submissions, and the final evaluations Description AUC C@1 F0.5U F1-Score Brier Small dataset, local test set 0.965 0.903 0.928 0.903 0.925 Small dataset, early submission 0.955 0.890 0.894 0.889 0.919 Small dataset, Final evaluation 0.967 0.910 0.907 0.927 0.929 Large dataset, local test set 0.967 0.909 0.918 0.915 0.928 Large dataset, Final evaluation 0.972 0.917 0.916 0.926 0.934 4. Discussion and Conclusion In this paper we presented the approach for an authorship verification model that works robustly in an open-world setting under varying topics. Our approach is an improvement over our earlier model which was submitted to PAN 2020 Authorship Verification task. Most of the improvements are made by incorporating new features. We would also like to discuss other ideas that we attempted to include but were not successful. We attempted to split each document into a few smaller documents and train a model on a larger number of smaller documents. We hoped by doing so we will be able to simulate having multiple documents per author and therefore be able to do multiple comparisons across the two authors. We expected the aggregated results of multiple comparisons would result in better performance or serve as another measure for classifier confidence which we could then use to leave out low confidence predictions. Our current attempts to use this strategy did not result in better performance values. We also attempted to encode features extracted from dependency parses. The performance gain after incorporating these features were very minimal. We ended up not including the dependency parse features due to the significant computing power required to do dependency parsing. Our work shows that, by selecting features that are less likely to encode topic information, it is possible to train a topic agnostic authorship verification model that works well in open-world settings. 5. Acknowledgements We thank PAN2021 organizers for organizing the shared task and helping us through the submission process. We also thank the reviewers for their helpful comments and feedback. Our work was supported by the National Science Foundation under grant 1931005 and the McNulty Foundation. References [1] M. Kestemont, I. Markov, E. Stamatatos, E. Manjavacas, J. Bevendorff, M. Potthast, B. Stein, Overview of the Authorship Verification Task at PAN 2021, in: CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021. [2] J. Bevendorff, B. Chulvi, G. L. D. L. P. Sarracén, M. Kestemont, E. Manjavacas, I. Markov, M. Mayerl, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wolska, , E. Zangerle, Overview of PAN 2021: Authorship Verification,Profiling Hate Speech Spreaders on Twitter,and Style Change Detection, in: 12th International Conference of the CLEF Association (CLEF 2021), Springer, 2021. [3] J. Weerasinghe, R. Greenstadt, Feature Vector Difference based Neural Network and Logistic Re- gression Models for Authorship Verification—Notebook for PAN at CLEF 2020, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), CLEF 2020 Labs and Workshops, Notebook Papers, CEUR- WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/. [4] M. Kestemont, E. Manjavacas, I. Markov, J. Bevendorff, M. Wiegmann, E. Stamatatos, M. Potthast, B. Stein, Overview of the Cross-Domain Authorship Verification Task at PAN 2020, in: L. Cap- pellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), CLEF 2020 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2020. [5] S. Bischoff, N. Deckers, M. Schliebs, B. Thies, M. Hagen, E. Stamatatos, B. Stein, M. Potthast, The Importance of Suppressing Domain Style in Authorship Analysis, CoRR abs/2005.14714 (2020). URL: https://arxiv.org/abs/2005.14714. [6] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, "O’Reilly Media, Inc.", 2009. [7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [8] E. F. Tjong Kim Sang, S. Buchholz, Introduction to the conll-2000 shared task: Chunking, in: C. Cardie, W. Daelemans, C. Nedellec, E. Tjong Kim Sang (Eds.), Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000, pp. 127–132. [9] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, " O’Reilly Media, Inc.", 2009. [10] E. Stamatatos, A survey of modern authorship attribution methods, Journal of the American Society for information Science and Technology 60 (2009) 538–556. [11] A. Abbasi, H. Chen, Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace, ACM Transactions on Information Systems 26 (2008) 1–29. doi:10.1145/1344411.1344413. [12] G. Hirst, O. Feiguina, Bigrams of syntactic labels for authorship discrimination of short texts, Literary and Linguistic Computing 22 (2007) 405–417. [13] K. Luyckx, W. Daelemans, Shallow text analysis and machine learning for authorship attribtion, LOT Occasional Series 4 (2005) 149–160. [14] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, Syntactic n-grams as machine learning features for natural language processing, Expert Systems with Applications 41 (2014) 853 – 860. URL: http://www.sciencedirect.com/science/article/pii/S0957417413006271. doi:https://doi.org/10.1016/j.eswa.2013.08.015, methods and Applications of Artifi- cial and Computational Intelligence. [15] M. Kestemont, E. Manjavacas, I. Markov, J. Bevendorff, M. Wiegmann, E. Stamatatos, M. Potthast, B. Stein, Overview of the Cross-Domain Authorship Verification Task at PAN 2020, in: L. Cap- pellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), CLEF 2020 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/. [16] P. Guiraud, Les caractères statistiques du vocabulaire: essai de méthodologie, Presses universitaires de France, 1954. [17] G. Herdan, Type-token mathematics, volume 4, Mouton, 1960. [18] G. Herdan, Quantitative linguistics, london: Buttersworths, See Also (1964). [19] D. Dugast, Vocabulaire et stylistique, volume 8, Slatkine, 1979. [20] H.-D. Mass, Über den zusammenhang zwischen wortschatzumfang und länge eines textes [rela- tionship between vocabulary and text length, Zeitschrift für Literaturwissenschaft und Linguistik 2 (1972) 73. [21] J. Tuldava, Quantitative relations between the size of the text and lexical richness, Journal of Linguistic Calculus (1977) 28–35. [22] É. Brunet, Le vocabulaire de Jean Giraudoux, structure et évolution, volume 1, Slatkine, 1978. [23] J. B. Carroll, Language and thought, Reading Improvement 2 (1964) 80. [24] H. S. Sichel, On a distribution law for word frequencies, Journal of the American Statistical Association 70 (1975) 542–547. [25] R. Michéa, Répétition et variété dans l’emploi des mots, Bulletin de la Société de Linguistique de Paris (1969) 1–24. [26] A. Honoré, Some simple measures of richness of vocabulary, Association for literary and linguistic computing bulletin 7 (1979) 172–177. [27] G. Herdan, A new derivation and interpretation of yule’s ‘characteristic’k, Zeitschrift für ange- wandte Mathematik und Physik ZAMP 6 (1955) 332–339. [28] C. U. Yule, The statistical study of literary vocabulary, Cambridge University Press, 2014. [29] E. H. Simpson, Measurement of diversity, nature 163 (1949) 688–688. [30] S. Bergsma, M. Post, D. Yarowsky, Stylometric analysis of scientific articles, in: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2012, pp. 327–337. [31] E. Stamatatos, Authorship attribution using text distortion, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 2017, pp. 1138–1149. [32] D. Castro-Castro, C. Rodríguez-Losada, R. Muñoz, Mixed Style Feature Representation and B0- maximal Clustering for Style Change Detection—Notebook for PAN at CLEF 2020, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), CLEF 2020 Labs and Workshops, Notebook Papers, CEUR- WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/. [33] A. Peñas, Álvaro Rodrigo, A simple measure to assess non-response, in: ACL, 2011, pp. 1415–1424. URL: http://www.aclweb.org/anthology/P11-1142. [34] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Generalizing unmasking for short texts, in: Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), As- sociation for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 654–659. URL: https: //www.aclweb.org/anthology/N19-1068. doi:10.18653/v1/N19-1068. [35] G. W. Brier, Verification of forecasts expressed in terms of probability, Monthly weather review 78 (1950) 1–3. [36] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/978-3-030-22948-1\_5.