Authorship Attribution in Fan-Fictional Texts given variable length Character and Word N-Grams Notebook for PAN at CLEF 2019 Lukas Muttenthaler?,1[0000−0002−0804−4687] , Gordon Lucas2[0000−0002−5626−6890] , and Janek Amann[0000−0002−9868−9568] University of Copenhagen (UCPH) 1 Department of Computer Science, 2 Department of Psychology mnd926@alumni. ku.dk Abstract The task of authorship attribution (AA) requires text features to be represented according to rigorous experiments. In the current study, we aimed to develop three different n-gram models to identify authors of various fan- fictional texts. Each of the three models was developed as a variable-length n- gram model. We implemented both a standard character n-gram model (2 − 5 gram), a distorted character n-gram model (1 − 3 gram) and a word n-gram model (1−3 gram) to not only capture the syntactic features, but also the lexical features and content of a given text. Token weighting was performed through term-frequency inverse-document frequency (tf-idf) computation. For each of the three models, we implemented a linear Support Vector Machine (SVM) classifier, and in the end applied a soft voting procedure to take the average of the classifiers’ results. Results showed, that among the three individual mod- els, the standard character n-gram model performed best. However, the com- bination of all three classifier’s predictions yielded the best results overall. To enhance computational efficiency, we computed dimensionality reduction us- ing Singular Value Decomposition (SVD) before fitting the SVMs with training data. With a run time of approximately 180 seconds for all 20 problems, we achieved a macro F1-score of 70.5% for the development corpus and a F1-score of 69% for the competition’s test corpus, which significantly outperformed the PAN 2019 baseline classifier. Thus, we have shown that it is not a single feature representation that will yield accurate classifications, but rather the combina- tion of various text representations that will depict an author’s writing style most thoroughly. Keywords: authorship attribution, n-grams, tf-idf, Support Vector Machine, Singular Value Decomposition ? Corresponding author. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland. 1 Introduction Authorship Attribution (AA) is the task of determining the author of a text from a set of candidates. It can be a daunting exercise for both humans and machines, if one does not know which parts of a document represent an author’s writing style. How- ever, if features are represented according to rigorous experiments (e.g., through the use of regular expressions and hyper parameter optimization) and adequately cap- ture the syntactic use of language, it may well support automated systems which aim to recognize a text’s author. In the context of machine learning, AA can be regarded as a multi-class, single-label text classification problem [6]. Its applications include plagiarism detection and forensic linguistics as well as research in literature [4,11]. For comprehensive surveys of this topic see [5] and [12]. In this working notes paper, we describe our approach to the cross-domain AA task of the PAN 2019 competition, which comprised of the identification of the au- thors of fan fiction [8]. Fan fiction describes literary works written by fans based on a previous, original literary work (also called the fandom). Fan fiction usually includes the characters from the original, it does, however, change or reinterpret other parts of the story, such as settings or endings, or explores a less prominent character in more detail [7]. In recent years, fan fiction has generated some controversy concern- ing the violation of intellectual rights [8]. In this years PAN competition, one of the tasks constitutes of 20 author identification problems in English, French, Italian and Spanish (5 for each language). Each problem featured 9 candidate authors, for each of which 7 known texts were provided. The term cross-domain in this particular con- text refers to the fact that the texts with known authorship were from different fan- doms, whereas the unknown texts were in a single and different fandom [8]. While PAN 2018 featured a closed set of candidate authors, this years task presents an open set problem: The real author of some tests cases are unknown. 2 Method 2.1 Feature Representation Similarly to last year’s best performing team [2], we deployed three different n-gram models to represent the fan-fictional texts. We implemented both character and word n-gram models. All models are variable length n-gram models as, according to recent experiments, variable length n-gram models both represent an author’s style more adequately and yield higher precision and recall scores than fixed n-gram models which do not capture the full scope of syntactic and lexical features [2,3,6]. The first model we implemented, which was a standard character n-gram model, consisted of bigram, trigram, four-gram and five-gram token representations (i.e., 2 − 5 grams). Hyperparameter tuning experiments revealed that, 2 − 5 gram mod- els yield the best results and represent an author’s text more thoroughly than any other additive lower or higher n-gram text representations. However, for this stan- dard character model we also computed an additional vector of unigram punctua- tion marks, which we then concatenated with all other character n-gram represen- tations after pre-processing computation. We kept punctuation marks as we consid- ered them important features for an author’s use of language. The use, and in partic- ular the frequency, of punctuation marks in a given text, reveals crucial information about an author’s writing style [1,4]. Furthermore, we converted each non-zero digit into a 0, since it is not a digit that represents a unique writing style, but rather how numbers are depicted in a given text (e.g., 1k vs. 1000; 1, 000 vs. 1.000). The latter step was implemented for all three models. This particular model primarily captured syn- tactic and morphological features. Table 1: Showcase of text distortion. The text was extracted from problem set 2. Original Text Distorted Text “Yes,” she says, “the needs of the many...” “***,” *** ****, “*** ***** ** *** ****...” “Yes,” you agree. “ “***,” *** *****. “** ***** ** *** ***** ***** ** It would be the worst thing in my life, but ** ****, *** ** ***** ** ***** **. **** ** * *****’* it would be worth it. Even if I weren’t able **** ** ****** **, *** ******** to choose it, for whatever The second model, which may be deemed a distorted character model, consisted of unigram, bigram and trigram token representations (i.e., 1 − 3 grams). This model particularly draws attention to punctuation marks, spacing, diacritics, and all other non-alphabetical characters, including numbers [2]. We replaced all non-diacritics through an asterisk symbol (*) to yield a uniform representation of dispensable char- acters and hence distorts the text (see Table 1). We deployed this model to account for an author’s orthography and punctuation that is not necessarily dependent on other aspects of language use. We suspected that punctuation and orthography might well represent individual features of language. To provide an example, an author might have preferences for ellipses, or might leave out accents in Spanish or French texts. Lastly, we computed a variable-length word n-gram model, which consisted of unigram, bigram and trigram token representations (i.e., 1 − 3 grams). As a word n- gram model, contrary to a character n-gram model, rather captures the vocabulary and register than morhological and syntactic features and thus draws attention to the content, we removed punctuation marks and special characters as such tokens do not reveal decisive information about an author’s choice of words. We only kept alphanumerical characters. 2.2 Text Preprocessing To summarize, for all three n-gram models, we computed the following text pre- processing steps: – Each non-zero digit was represented by a 0 digit place-holder. This step was computed, since different digits neither reveal crucial information about the vo- cabulary and register nor about the syntax of a given text. What’s important is whether a digit or number appeared in a text and how frequently it was used. Furthermore, to gather information about syntactic features, it is indispensable to know how numerical characters were depicted by an author (see section on standard character n-gram model). – Each hyperlink, that appeared in a given text, was replaced by an @ sign. We computed the latter step, as we did not deem hyperlinks crucial features for an author’s use of language. What’s decisive is whether and how frequently hyper- links were used within a given text but not the actual hyperlink that was cited. – For each string, new lines denoted as either "\n" or "\t" were replaced by the empty string. – We did not perform text lower-casing as we suppose, that the usage (and in par- ticular its corresponding frequency) of upper-case letters (e.g., nouns, proper nouns, entities, named entities) reveal important information about the writing style of an author. – We did not remove stop words as we believe, that function words (e.g., "the", "a", "have") - which, in the vast majority of documents, are the most frequently used words - reveal decisive insights into an author’s use of language. However, we still assign lower weights to terms that appear highly frequently across the entire corpus. – Weighting for both character and word n-gram models was performed through term-frequency inverse-document-frequency (tf-idf) computation to assign lower weights to words, which appear more frequently across the entire corpus, and higher weights to words which are found only in specific documents and thus reveal crucial information about a text. To prevent a division by zero, we com- puted smooth inverse document frequency instead of normal idf. Smoothing ³ ´ N idf means adding a 1 to the denominator within the logarithmic fraction: log 1+n t . – To decrease dimensionality, enhance computational efficiency and reduce time complexity, we computed Singular Value Decomposition (SVD) for each of the three variable length n-gram models. The final number of dimensions chosen (d = 63) was dependent on the number of samples per problem. As each prob- lem consisted of 63 texts and d cannot be higher than the number of samples in SVD, we deemed this number as an appropriate size for d to capture the most variance. 2.3 Models For each of the three models (standard character, distorted character and word n- grams) we developed linear, multi-class Support Vector Machines (SVM). In so doing, we made use of Python’s Sklearn library [9]. Each classifier was cross-validated three times. Results of the three individual SVMs were combined in a soft-voting manner: We simply averaged the probabilities for each candidate author across the three models. Initial experiments revealed that a pure averaging yields better results than feeding the probabilities into a new classifier. Hence, we did not deploy a fourth model (see Figure 1). The soft voting computation was performed as follows: Pk i =1 p i arg max (1) k where p i is the probabilities vector for each individual model and k = 3 Figure 1: Illustration of the model architecture. After text pre-processing, features for the indi- vidual models were extracted. The three individual probability outputs are combined through a soft voting procedure into a final prediction. The procedure was as follows: – Firstly, the predicted probabilities obtained by each individual classifier were concatenated into a n × d matrix, where n is the number of classifiers (n = 3) and d denotes the number of classes / candidates. – Secondly, for each text, we computed the average among the three 1×d probabil- ity vectors, which, according to soft voting, yields a more accurate probabilities depiction than individual probability distributions. – Lastly, for each text, a "new" average probability distribution was computed, which served as the probabilities vector for the final prediction. Our classifiers were required to consider that a test text may have been written by an unknown author. To enhance this algorithmic decision-making process, our models classified a text’s author as unknown, if the difference between the model’s highest and second highest probabilities was below 0.1, or if the highest probability was be- low 0.25. The latter served as an additional feature (compared to the PAN 2019 base- line model), we regarded as crucial to include in our algorithm. The hyper parame- ters of the final analysis pipeline are summarized in Table 2. Our final model for the PAN 2019 shared task was deployed on a virtual machine using the TIRA architecture [10]. Table 2: Settings of the final model term extraction n-gram range std_char (2-5) dist_char (1-3) word (1-3) feature extraction tf sublinear idf smoothed norm L2 proportion of n-grams used 0.5 scaling MaxAbsScaler dimensionality reduction SVD 63 components classification classifier SVM decision procedure soft vote metric average min difference 0.1 min probability 0.25 3 Results Table 3 summarizes our results obtained for the development corpus. We compared performances between the official PAN 2019 baseline SVM classifier, our three indi- vidual classifiers and the soft vote average model. The highest scores for each prob- lem are displayed in bold face. Table 3: Comparison of Macro F1 scores for our different models on the PAN 2019 AA devel- opment corpus Problem Language nr test texts baseline char dist word soft vote 01 en 561 0.695 0.741 0.742 0.631 0.857 02 en 137 0.447 0.552 0.423 0.455 0.553 03 en 211 0.491 0.620 0.579 0.489 0.738 04 en 273 0.331 0.417 0.238 0.299 0.537 05 en 264 0.473 0.481 0.417 0.475 0.585 06 fr 121 0.702 0.711 0.655 0.437 0.777 07 fr 92 0.499 0.551 0.427 0.469 0.588 08 fr 430 0.506 0.569 0.474 0.411 0.673 09 fr 239 0.599 0.656 0.636 0.437 0.723 10 fr 38 0.442 0.544 0.481 0.303 0.658 11 it 139 0.651 0.708 0.662 0.505 0.780 12 it 116 0.594 0.685 0.527 0.584 0.658 13 it 196 0.687 0.762 0.572 0.625 0.786 14 it 46 0.583 0.680 0.725 0.464 0.750 15 it 54 0.745 0.778 0.451 0.654 0.785 16 sp 164 0.768 0.826 0.536 0.705 0.843 17 sp 112 0.584 0.653 0.497 0.634 0.723 18 sp 238 0.704 0.803 0.706 0.610 0.823 19 sp 450 0.556 0.635 0.441 0.505 0.682 20 sp 170 0.511 0.530 0.141 0.294 0.479 mean all 203 0.578 0.645 0.516 0.499 0.705 No changes were made to the provided baseline classifier; as such it utilized only character trigrams, a minimum document frequency of 5 (character trigrams that appeared less frequently than five times across the corpus were not included in the vocabulary), no text lower casing and no weighting of the bag of words (only using a count-based approach), a one-vs-rest multi-class strategy. It classified documents as unknown, if and only if, the highest and second highest predictions were less than 0.1 apart. Results showed that the standard character n-gram model performed generally better than the variable length word n-gram or distorted character n-gram model. However, in the vast majority of runs, the soft voting classifier resulted in a higher score than any of the individual models (see Table 2). According to the mean scores obtained for the three individual models, the word n-gram model showed the worst performance. This is in line with the assumption, that authorship manifests itself in style and thus a text’s syntax and morphology rather than in a text’s vocabulary. Word n-grams primarily encode lexical information about a document, whereas standard character n-grams and diacritic characters n-grams rather capture the syntactic and morphological characteristics of the text, which further reveal information about an author’s writing style. SVD implementation reduced our algorithm’s computational time from approx. 30 minutes to just 180 seconds. In the final evaluation, we, unfortunately, were un- able to deploy SVD due to time constraints, which is why the run time for our PAN 2019 model is 30 minutes and not 180 seconds. The macro F1-score (approx. 70%), however, did not change as a result of dimensionality reduction. 4 Discussion Results displayed that the soft voting procedure notably outperformed the individual classifiers. Intuitively, one might assume that averaging over classifiers with lower macro F1-scores might yield a worse and not a better performance. However, we did not perform averaging across the F1-scores, but across the predicted probabilities ŷ i obtained for each classifier. This served as an additive factor as all three feature representations were combined into one thorough representation. One limitation to our approach might have been that we did not apply a weighted voting procedure. This could and shall be addressed in future research. Moreover, it might be interesting to consider a hard instead of a soft voting procedure. We further encourage others to deploy different machine learning classifiers, such as Multino- mial Logistic Regression, Multinomial Naive Bayes or Neural Networks. Additionally, further experiments might compute Principal Component Analysis (PCA) instead of Singular Value Decomposition (SVD) to reduce dimensionality and enhance compu- tational efficiency. 5 Conclusion This paper presented our soft voting ensemble classifier for cross-domain author- ship attribution. We combined a standard character n-gram model, a distorted char- acter n-gram model and a word n-gram model to achieve more accurate predictions than the individual models themselves. All three models represented the texts as variable length n-grams, which were weighted by term frequency-inverse document frequency (tf-idf). Our algorithm can be perceived as an enhancement of the PAN 2019 baseline system. One may infer from the results we obtained, that authorship attribution models generally benefit from the inclusion of different representations of text. It is not a single feature representation that will yield accurate classifications, but rather the combination of various document representations that will depict an author’s writing style. References 1. Can, M.: Authorship attribution using principal component analysis and competitive neural networks. Mathematical and Computational Applications 19(1), 21–36 (2014). https://doi.org/https://doi.org/10.3390/mca19010021 2. Custódio, J.E., Paraboni, I.: EACH-USP ensemble cross-domain authorship attribution. In: Proceedings of the Ninth International Conference of the CLEF Association (2018) 3. Daneshvar, S., Inkpen, D.: Gender Identification in Twitter using N-grams and LSA Note- book for PAN at CLEF 2018. In: Proceedings of the Ninth International Conference of the CLEF Association (2018) 4. Houvardas, J., Stamatatos, E.: N-Gram Feature Selection for Authorship Iden- tification. In: International conference on artificial intelligence: Methodol- ogy, systems, and applications. pp. 77–86. Springer, Berlin, Heidelberg (2006). https://doi.org/10.1007/11861461_10 5. Koppel, M., Schlier, J., Argamon, S.: Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology 60(1), 9–26 (2009). https://doi.org/10.1002/asi.20961 6. Markov, I., Stamatatos, E., Sidorov, G.: Improving cross-topic authorship attribu- tion: The role of pre-processing. In: International Conference on Computational Linguistics and Intelligent Text Processing. pp. 289–302. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-77116-8_21 7. Milli, S., Bamman, D.: Beyond Canonical Texts: A Computational Analysis of Fanfiction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing. pp. 2048–2053 (2016). https://doi.org/10.18653/v1/d16-1218 8. PAN: PAN @ CLEF 2019: Cross-domain Authorship Attribution (2019), https://pan. webis.de/clef19/pan19-web/author-identification.html 9. Pedregosa, F., Varoquaux, G., Gramfort, A., Thirion, B., Michel, V., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Vanderplas, J., Cournapeau, D., Dubourg, V., Passos, A., Brucher, M., Perrot, M., Duchesnay, Ã.: Scikit-learn: Machine Learning in Python. Journal of Ma- chine Learning Research 12, 2825–2830 (2011) 10. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF. Springer (2019) 11. Sapkota, U., Bethard, S., Montes, M., Solorio, T.: Not All Character N-grams Are Created Equal: A Study in Authorship Attribution. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human lan- guage technologies. pp. 93–102. Association for Computational Linguistics (ACL) (2015). https://doi.org/10.3115/v1/n15-1010 12. Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the American Society for information Science and Technology 60(3), 538–556 (2009). https://doi.org/https://doi.org/10.1002/asi.21001