Datasets and Models for Authorship Attribution on Italian Personal Writings Gaetana Ruggiero• , Albert Gatt• , Malvina Nissim • Institute of Linguistics and Language Technology, University of Malta, Malta  Center for Language and Cognition, University of Groningen, The Netherlands garuggiero@gmail.com, albert.gatt@um.edu.mt, m.nissim@rug.nl Abstract AA research in Italian has largely focused on the single case of Elena Ferrante (Tuzzi and Corte- Existing research on Authorship Attribu- lazzo, 2018) 1 . The present work seeks a more tion (AA) focuses on texts for which a lot realistic take, using more diverse, user-generated of data is available (e.g novels), mainly in data namely web forums comments and diary frag- English. We approach AA via Authorship ments, thereby introducing two novel datasets for Verification on short Italian texts in two this task: ForumFree and Diaries. novel datasets, and analyze the interaction between genre, topic, gender and length. We cast the AA problem as authorship verifica- Results show that AV is feasible even with tion (AV). Rather than identifying the specific au- little data, but more evidence helps. Gen- thor of a text (the most common task in AA), AV der and topic can be indicative clues, and aims at determining whether two texts were writ- if not controlled for, they might overtake ten by the same author or not (Koppel and Schler, more specific aspects of personal style. 2004; Koppel et al., 2009). The GLAD system of Hürlimann et al. (2015) 1 Introduction and Background was specifically developed to solve AV problems, Authorship Attribution (AA) is the task of iden- and has been shown to be highly adaptable to new tifying authors by their writing style. In addition datasets (Halvani et al., 2018). GLAD uses an to being a tool for studying individual language SVM with a variety of features including charac- choices, AA is useful for many real-life appli- ter level ones, which have proved to be most effec- cations, such as plagiarism detection (Stamatatos tive for AA tasks (Stamatatos, 2009; Moreau et al., and Koppel, 2011), multiple accounts detection 2015; Hürlimann et al., 2015), and is freely avail- (Tsikerdekis and Zeadally, 2014), and online se- able. Moreover, Kestemont et al. (2019) show that curity (Yang and Chow, 2014). many of the best models for authorship attribution Most work on AA focuses on English, on rela- are based on Support Vector Machines. Hence we tively long texts such as novels and articles (Juola, adopt GLAD in the present study. 2015) where personal style could be mitigated due More specifically, we run GLAD on our to editorial interventions. Furthermore, in many datasets and study the interaction of four differ- real-world applications the texts of disputed au- ent dimensions: topic, gender, amount of evidence thorship tend to be short (Omar et al., 2019). per author, and genre. In practice, we design intra- The PAN 2020 shared task was originally topic, cross-topic, and cross-genre experiments, meant to investigate multilingual AV in fanfiction, controlling for gender and amount of evidence per focusing on Italian, Spanish, Dutch and English author. The focus on cross-topic and cross-genre (Bevendorff et al., 2020). However, the datasets AV is in line with the PAN 2015 shared task (Sta- were eventually restricted to English only, to max- matatos et al., 2015); this setting has been shown imize the amount of available training data (Keste- to be more challenging than the task definitions mont et al., 2020), emphasizing the difficulty in of previous editions (Juola and Stamatatos, 2013; compiling large enough datasets for less-resourced Stamatatos et al., 2014). languages. Copyright c 2020 for this paper by its authors. Use per- 1 mitted under Creative Commons License Attribution 4.0 In- https://www.newyorker.com/culture/cultural- ternational (CC BY 4.0). comment/the-unmasking-of-elena-ferrante Contributions We advance AA for Italian in- ten in the past, were removed from the dataset, to- troducing two novel datasets, ForumFree and Di- gether with their authors when this was the only aries, which contribute to enhance the amount of text associated with them. available Italian data suitable for AA tasks.2 The stories narrated in the diaries are of a very Running a battery of experiments on personal personal nature, which means that many proper writings, we show that AV is feasible even with nouns and names of locations are used. To avoid little data, but more evidence helps. Gender and relying on these explicit clues, which are strong topic can be indicative clues, and if not controlled but not indicative of personal writing style, we for, they might overtake more specific aspects of perform Named Entity Recognition (NER), us- personal style. ing spaCy (Honnibal, 2015). Person names, lo- cations and organizations were replaced by their 2 Data corresponding labels, namely PER, LOC, ORG. For the present study, we introduce two novel The fourth label used by spaCy, MISC (miscel- datasets, ForumFree and Diaries. Although al- lany), was not considered; dates were also not nor- ready compiled (Maslennikova et al., 2019), the malized. Moreover, a separate set of experiments original ForumFree dataset was not meant for AA. was performed by bleaching the diary texts prior Therefore, we reformat it following the PAN for- to their input to the GLAD system. The bleach- mat3 . The dataset contains web forum comments ing method was proposed by van der Goot et al. taken from the ForumFree platform4 , and the sub- (2018) in the context of cross-lingual Gender Pre- set used in this work covers two topics, Medicina diction, and consists of transforming tokens into Estetica (“Aesthethic Medicine”) and Programmi an abstract representation that masks lexical forms Tv (“Tv Programmes”; Celebrities in the origi- while maintaining key features. We only use 4 of nal dataset). A third subset, Mix, is the union of the 6 original features. Shape transforms upper- the first two. The Diaries dataset is originally as- case letters into ‘U’, lowercase ones into ‘L’, dig- sembled for the present study, and contains a col- its into ‘D’, and the rest into ‘X’. PunctA replaces lection of diary fragments included in the project emojis with ‘J’, emoticons with ‘E’, punctuation Italiani all’estero: i diari raccontano (“Italians with ‘P’ and one or more alphanumeric characters abroad: the diaries narrate”).5 For Diaries, no with a single ‘W’. Length represents a word by the topic classification has been taken into account. number of its characters. Frequency corresponds Table 1 shows an overview of the datasets. to the log frequency of a token in the dataset. The features are then concatenated. The word ‘House’ Subset # Authors # Docs W/A D/A W/D would be rewritten as ‘ULLLL W 05 6’. F M Tot Med Est 33 44 77 56198 63 661 48 2.2 Reformatting Prog TV 78 71 149 153019 32 812 22 Mix 111 115 276 209217 41 791 29 We reformat both datasets in order to make them Diaries 77 188 275 1422 462 5 477 suitable for AV. The data is divided into so-called problems: each problem is made of a known and Table 1: Overview of the datasets. W/A = Avg an unknown text of equal length. words per author; D/A = Avg docs per author; To account for the shortness of the texts and to W/D = Avg words per doc. avoid topic biases that would derive by taking con- secutive text as known and unknown fragments, 2.1 Preprocessing all the documents written by the same author are first shuffled and then concatenated into a single For the ForumFree dataset, comments which only string. The string is split into two spans contain- contained the word up, commonly used on the in- ing the same number of words, so that the words ternet to give new visibility to a post that was writ- contained in the unknown span come from subsets 2 Further information about the datasets can be found at of texts which are different from the ones that form https://github.com/garuggiero/Italian-Datasets-for-AV the known one. An example of this process is dis- 3 https://pan.webis.de/clef15/pan15-web/authorship- verification.html played in Figure 1. Rather than being represented 4 https://www.forumfree.it/ by individual productions, each author is therefore 5 https://www.idiariraccontano.org represented by a set of texts, whose original se- Figure 1: Example of the creation of known and unknown documents for the same author when consid- ering 400 words per author. quential order has been altered. Each known text Given that no topic classification is available is paired with an unknown text from the same au- for the diaries, the CT experiments are only per- thor. To create negative instances, given a dataset formed on the ForumFree dataset. We train the with multiple problems, one can (i) make use system on Medicina Estetica and test it on Pro- of external documents (extrinsic approach (Seid- grammi Tv, and vice versa. man, 2013; Koppel and Winter, 2014)), or (ii) use fragments collated from all authors in the train- Gender Previous work has shown that similarity ing data, except the target author (intrinsic ap- can be observed in writings of people of the same proach). We create negative instances with an gender (Basile et al., 2017; Rangel et al., 2017).6 intrinsic approach. More specifically, following In order to assess the influence of same vs different Dwyer (2017), the second half of the unknown ar- gender in AA, we consider three gender settings: ray is shifted by one, so that the texts of the second only female authors and only male authors (single- half of the known array are paired with a different- gender), and mixed-gender, where the known and author text in the unknown array. In this way, the unknown document can be either written by two label distribution is balanced. authors of the same gender, or by a male and a female author. In dividing the subsets according 3 Method to the gender of the authors, we consider gender implicitly. However, we also perform experiments Given a pair of known and unknown fragments adding gender as feature to the instance vectors, (KU pair), the task is to predict whether they are indicating both the gender of the known and un- written by the same author or not. In designing our known documents’ authors and whether or not the experiments, we control for topic, gender, amount gender of the authors is the same. of evidence, and genre. The latter is fostered by the diverse nature of our datasets. Evidence Following Feiguina and Hirst (2007), we experiment with KU pairs of different sizes, Topic Maintaining the topic roughly constant i.e. with 400, 1 000, 2 000 and 3 000 words per au- should allow stylistic features to gain more dis- thor. Each element of the KU pair is thus made up criminative value. We design intra-topic (IT) and of 200, 500, 1 000 and 1 500 words respectively. cross-topic experiments (CT). In IT, we distin- To observe the effect of the different text sizes on guish same- and different-topic KU pairs. In the classification, we manipulate the number of in- same-topic, we train and test the system on KU stances in training and test, so that the same au- pairs from the same topic. In different-topic, we thors are included in all the different word settings include the Mix set and the diaries. Since we train of a single topic-gender experiment. and test on a mixture of topics and there can be 6 Binary gender is a simplification of a much more nu- topic overlap, these are not truly cross-topic, and anced situation in reality. Following previous work, we adopt we do not consider them as such. it for convenience. Genre We perform cross-genre experiments failed to outperform a majority baseline (de Vries, (CG) by training on ForumFree and testing on the 2020). He concluded that Tranformer-encoder Diaries, and vice versa. models might not suitable for AA tasks, since they will likely overfit if the documents contain no re- Splits and Evaluation We train on 70% and test liable clues of authorship (de Vries, 2020). on 30% of the instances. However, since we are controlling for gender and topic, the number of 4 Results and Discussion instances contained in the training and test sets varies in each experiment. We keep the test sets The number of experiments is high due to the in- stable across IT, CT and CG experiments, so that teraction of the dimensions we consider. we can compare results. Following the PAN eval- Tables 2 and 3 only include the mixed-gender uation settings (Stamatatos et al., 2015), we use results of the IT experiments on Mix (which cor- three metrics. c@1 takes into account the num- responds to the entire ForumFree dataset used for ber of problems left unanswered and rewards the this study) and Diaries, respectively. Results con- system when it classifies a problem as unanswered cerning all dimensions considered are anyway dis- rather than misclassifying it. cussed in the text. We refer to the combined score. Probability scores are converted to binary an- Since the baseline results are different for each set- swers: every score greater than 0.5 becomes a ting, we do not include them. However, all mod- positive answer, every score smaller than 0.5 cor- els perform consistently above their corresponding responds to a negative answer and every score baseline. which is exactly 0.5 is considered as an unan- For the Mix topic, we achieved 0.966 with 96 swered problem. The AU C measure corresponds authors in total and 3 000 words (Table 2). For the to the area under the ROC curve (Fawcett, 2006), diaries, we achieved 0.821 with 46 authors in total and tests the ability of the system to rank scores and 3 000 words each (Table 3).7 Although the properly, assigning low values to negative prob- training and test sets are of different sizes for both lems and high values to positive ones (Stamatatos datasets, more evidence seems to help the model et al., 2015). The third measure is the product of to solve the problem. c@1 and AU C. In the IT experiments, the highest score for Medicina Estetica is 0.923, with 41 authors in total Model We run all experiments using GLAD and 1 000 words per author, and for Programmi Tv (Hürlimann et al., 2015). This is an SVM with rbf 0.944, with 59 authors and 3 000 words each. In kernel, implemented using Python’s scikit-learn the CT setting, the scores stay basically the same (Pedregosa et al., 2011) library and NLTK (Bird et in both directions. In CG, when training on the al., 2009). GLAD was designed to work with 24 diaries and testing on Mix, we obtain the same different features, which take into account stylom- score when training on Mix with 3 000 words. etry, entropy and data compression measures. We When training on Mix and testing on Diaries, we compare GLAD to a simple baseline which ran- achieved 0.737 on the same test set, and 0.748 with domly assigns a label from the set of possible la- 1 000 words per instance. bels (i.e. ‘YES’ or ‘NO’) to each test instance. Our choice fell on GLAD for a variety of rea- Discussion When more variables interact in the sons. As a general observation, even in later chal- same subset, as in mixed-gender sets of the Fo- lenges, SVMs have proven to be the most effec- rumFree and Diaries dataset, we found that the tive for AA tasks (Kestemont et al., 2019). More classifier uses the implicit gender information. In- specifically, in a survey of freely available AA sys- deed, it achieves slightly better scores in mixed- tems, GLAD showed best performance and espe- gender settings than in female- and male-only cially high adaptability to new datasets (Halvani ones, suggesting that the classifier might be using et al., 2018). Lastly, de Vries (2020) has ex- internal clustering of the data rather than writing plored fine-tuning a pre-trained model for AV in style characteristics. This also explains why re- Dutch, a less-resourced language compared to En- sults are higher in Mix than in separate topics, be- glish. He found that fine-tuning BERTje (a Dutch cause the classifier can use topic information. monolingual BERT-model, (de Vries et al., 2019)) 7 Using a bleached representation of the texts, the score with PAN 2015 AV data (Stamatatos et al., 2015), increased by 0.36 # Problems Eval # W/A # Auth Train Test C I U c@1 AUC * 400 127 88 39 33 6 0 0.846 0.947 0.801 1 000 109 76 33 30 3 0 0.909 0.926 0.842 2 000 100 70 30 29 1 0 0.967 0.995 0.962 3 000 96 67 29 28 1 0 0.966 1.000 0.966 Table 2: Training and test set configurations and IT evaluation scores on Mix texts written by female and male authors. C,I and U are Correct, Incorrect, Unanswered problems. # Problems Eval # W/A # Auth Train Test C I U c@1 AUC * 400 229 160 69 47 21 1 0.691 0.725 0.500 1 000 180 126 54 43 11 0 0.796 0.891 0.709 2 000 98 68 30 25 5 0 0.833 0.905 0.754 3 000 46 32 14 12 2 0 0.857 0.958 0.821 Table 3: Training and test configurations and IT evaluation scores on diaries made of NE converted text written by both genders. C,I and U are Correct, Incorrect, Unanswered problems. We also observe that by adding gender as an ex- While making the task more challenging, control- plicit feature in topic- and gender-controlled sub- ling for gender and topic ensures that the system sets, GLAD uses this information to improve clas- prioritizes authorship over different data clusters. sification, especially in mixed-gender scenarios. Although the datasets used are intended for AV Although previous research demonstrated that problems, they can be easily adapted to other AA CT and CG experiments are harder than IT ones tasks. We believe this to be one of the major con- (Sapkota et al., 2014; Stamatatos et al., 2015), tributions of our work, as it can help to advance in our case the scores for the three settings are the up-to-now limited AA research in Italian. comparable. However, since we only performed CT and CG experiments on mixed-gender subsets, Acknowledgments the gender-specific information might have also The ForumFree dataset was a courtesy of the Ital- played a role in this process (see above). ian Institute of Computational Linguistics “Anto- Overall, the experiments show that using a nio Zampolli” (ILC) of Pisa.8 higher number of words per author is preferable. Although 3 000 words seems to be optimal for most settings, in the large number of experiments References that we carried out (not all included in this paper) we also observed that lower amounts of words also Angelo Basile, Gareth Dwyer, Maria Medvedeva, Jo- sine Rawee, Hessel Haagsma, and Malvina Nis- led to comparable results. This aspect will require sim. 2017. N-GrAM: New Groningen Author- further investigation. profiling Model—Notebook for PAN at CLEF 2017. In CEUR Workshop Proceedings, volume 1866. 5 Conclusion Janek Bevendorff, Bilal Ghanem, Anastasia Gi- We experimented with AV on Italian forum com- achanou, Mike Kestemont, Enrique Manjavacas, Martin Potthast, Francisco Rangel, Paolo Rosso, ments and diary fragments. We compiled two Günther Specht, Efstathios Stamatatos, Benno Stein, datasets and performed experiments which consid- Matti Wiegmann, and Eva Zangerle. 2020. Shared ered the interaction among topic, gender, length Tasks on Authorship Analysis at PAN 2020. In Joe- and genre. Even when the texts are short and mon M. Jose, Emine Yilmaz, João Magalhães, Pablo present more individual variation than traditional Castells, Nicola Ferro, Mário J. Silva, and Flávio Martins, editors, Advances in Information Retrieval, texts used in AA, AV is a feasible task, but having 8 more evidence per author improves classification. http://www.ilc.cnr.it/ pages 508–516, Cham. Springer International Pub- Overview of the Cross-Domain Authorship Verifi- lishing. cation Task at PAN 2020. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aurélie Névéol, Steven Bird, Ewan Klein, and Edward Loper. 2009. editors, CLEF 2020 Labs and Workshops, Notebook Natural language processing with Python: analyz- Papers. CEUR-WS.org, September. ing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Moshe Koppel and Jonathan Schler. 2004. Authorship verification as a one-class classification problem. In Wietse de Vries, Andreas van Cranenburgh, Arianna Proceedings of the twenty-first international confer- Bisazza, Tommaso Caselli, Gertjan van Noord, and ence on Machine learning, page 62. Malvina Nissim. 2019. Bertje: A Dutch BERT model. arXiv preprint arXiv:1912.09582. Moshe Koppel and Yaron Winter. 2014. Determin- ing if two documents are written by the same author. Wietse de Vries. 2020. Language Models are not Journal of the Association for Information Science just English Anymore: Training and Evaluation and Technology, 65(1):178–187. of a Dutch BERT-based Language Model Named BERTje. Master Thesis in Information Science, Moshe Koppel, Jonathan Schler, and Shlomo Arga- University of Groningen, The Netherlands. mon. 2009. Computational methods in authorship attribution. Journal of the American Society for in- Gareth Terence Bryan Dwyer. 2017. Novel approaches formation Science and Technology, 60(1):9–26. to authorship attribution. Master Thesis in Lan- Aleksandra Maslennikova, Paolo Labruna, Andrea guage and Communication Technologies, Informa- Cimino, and Felice Dell’Orletta. 2019. Quanti anni tion Science, University of Groningen, The Nether- hai? Age Identification for Italian. In Proceedings lands. of 6th Italian Conference on Computational Linguis- Tom Fawcett. 2006. An introduction to roc analysis. tics (CLiC-it), 13-15 November, 2019, Bari, Italy. Pattern recognition letters, 27(8):861–874. Erwan Moreau, Arun Jayapal, Gerard Lynch, and Carl Vogel. 2015. Author verification: basic stacked Olga Feiguina and Graeme Hirst. 2007. Authorship generalization applied to predictions from a set attribution for small texts: Literary and forensic ex- of heterogeneous learners-notebook for pan at clef periments. In Proceedings of the SIGIR’07 Work- 2015. In Linda Cappellato, Nicola Ferro, Gareth shop on Plagiarism Analysis, Authorship Identifica- Jones, and Eric San Juan, editors, CLEF 2015 Eval- tion, and Near-Duplicate Detection (PAN 2007). uation Labs and Workshop – Working Notes Papers, Oren Halvani, Christian Winter, and Lukas Graner. 8-11 September, Toulouse, France. CEUR-WS.org. 2018. Unary and binary classification approaches Abdulfattah Omar, Basheer Ibrahim Elghayesh, and and their implications for authorship verification. Mohamed Ali Mohamed Kassem. 2019. Author- arXiv preprint arXiv:1901.00399. ship attribution revisited: The problem of flash fic- tion a morphological-based linguistic stylometry ap- Matthew Honnibal. 2015. spacy: Industrial-strength proach. Arab World English Journal (AWEJ) Vol- natural language processing (nlp) with python and ume, 10. cython. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- Manuela Hürlimann, Benno Weck, Esther van den fort, Vincent Michel, Bertrand Thirion, Olivier Berg, Simon Suster, and Malvina Nissim. 2015. Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Glad: Groningen lightweight authorship detection. Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: In CLEF (Working Notes). Machine learning in python. the Journal of machine Learning research, 12:2825–2830. Patrick Juola and Efstathios Stamatatos. 2013. Overview of the Author Identification Task at PAN Francisco Rangel, Paolo Rosso, Martin Potthast, and 2013. CLEF (Working Notes), 1179. Benno Stein. 2017. Overview of the 5th author pro- filing task at pan 2017: Gender and language variety Patrick Juola. 2015. The Rowling case: A pro- identification in twitter. Working notes papers of the posed standard analytic protocol for authorship CLEF, pages 1613–0073. questions. Digital Scholarship in the Humanities, 30(suppl 1):i100–i113. Upendra Sapkota, Thamar Solorio, Manuel Montes, Steven Bethard, and Paolo Rosso. 2014. Cross- Mike Kestemont, Efstathios Stamatatos, Enrique Man- topic authorship attribution: Will out-of-topic data javacas, Walter Daelemans, Martin Potthast, and help? In Proceedings of COLING 2014, the 25th In- Benno Stein. 2019. Overview of the Cross-domain ternational Conference on Computational Linguis- Authorship Attribution Task at PAN 2019. In CLEF tics: Technical Papers, pages 1228–1237. (Working Notes). Shachar Seidman. 2013. Authorship verification us- Mike Kestemont, Enrique Manjavacas, Ilia Markov, ing the impostors method. In CLEF 2013 Eval- Janek Bevendorff, Matti Wiegmann, Efstathios Sta- uation labs and workshop–Working notes papers, matatos, Martin Potthast, and Benno Stein. 2020. pages 23–26. Citeseer. Efstathios Stamatatos and Moshe Koppel. 2011. Pla- giarism and authorship analysis: introduction to the special issue. Language Resources and Evaluation, 45(1):1–4. Efstathios Stamatatos, Walter Daelemans, Ben Verho- even, Martin Potthast, Benno Stein, Patrick Juola, Miguel A Sanchez-Perez, and Alberto Barrón- Cedeño. 2014. Overview of the author identifi- cation task at pan 2014. In CLEF 2014 Evalu- ation Labs and Workshop Working Notes Papers, Sheffield, UK, 2014, pages 1–21. Efstathios Stamatatos, Walter Daelemans, Ben Verho- even, Patrick Juola, Aurelio López-López, Martin Potthast, and Benno Stein. 2015. Overview of the author identification task at pan 2015. clef 2015 evaluation labs and workshop, online working notes, toulouse, france. In CEUR Workshop Proceedings, pages 1–17. Efstathios Stamatatos. 2009. A survey of modern au- thorship attribution methods. Journal of the Ameri- can Society for information Science and Technology, 60(3):538–556. Michail Tsikerdekis and Sherali Zeadally. 2014. Mul- tiple account identity deception detection in social media using nonverbal behavior. IEEE Transactions on Information Forensics and Security, 9(8):1311– 1321. Arjuna Tuzzi and Michele A Cortelazzo. 2018. Draw- ing Elena Ferrante’s Profile: Workshop Proceed- ings, Padova, 7 September 2017. Padova UP. Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malv- ina Nissim, and Barbara Plank. 2018. Bleaching text: Abstract features for cross-lingual gender pre- diction. arXiv preprint arXiv:1805.03122. Min Yang and Kam-Pui Chow. 2014. Authorship at- tribution for forensic investigation with thousands of authors. In IFIP International Information Security Conference, pages 339–350. Springer.