1. Introduction

Overview of the Authorship Verification Task at PAN 2022

Efstathios Stamatatos

Mike Kestemont

Krzysztof Kredens

Piotr Pezik

Annina Heini

Janek Bevendorf

Benno Stein

Martin Potthast

2 0 Aston University 1 Bauhaus-Universität Weimar 2 Leipzig University 3 University of Antwerp 4 University of the Aegean

2022

5 8

The authorship verification task at PAN 2022 follows the experimental setup of similar shared tasks in the recent past. However, it focuses on a diferent, and very challenging scenario: given two texts belonging to diferent discourse types, the task is to determine whether they are written by the same author. Based on a new corpus in English, we provide pairs of texts using four discourse types: essays, emails, text messages, and business memos. The diferences in communicative purpose, intended audience, and the level of formality render the cross-discourse-type authorship verification task very hard. We received 7 submissions and evaluated them using the TIRA integrated research architecture, along with two baseline approaches. This paper reviews the submissions and presents a detailed discussion of the evaluation results.

1. Introduction

Author identification (or authorship attribution) aims to reveal information about the individual(s) who wrote a text [ 1, 2 ]. There are several relevant tasks that emulate real-world conditions, mainly closed-set authorship attribution (where there is a finite list of candidate authors) and open-set authorship attribution (where there is a set of candidate authors but this does not necessarily include the true author(s)) [ 3 ]. The former scenario suits cases where only a short list of persons could eventually be the authors of disputed texts while the latter can be applied to cases where such lists of candidates are not available (or reliable enough). A special case of open-set attribution is authorship verification where there is only one candidate author [ 4 ]. Among author identification tasks, authorship attribution plays a key role since any given case can be decomposed into a series of authorship verification instances.

In authorship verification, texts of known authorship by one author are presented to a system, which is then tasked to verify whether another text has also been written by that same author [ 5, 6 ]. In its simplest form, only one text of known authorship is given [ 7 ]. In that case, for a pair of texts (typically one of known authorship and another of unknown authorship), we are asked to determine whether they are written by the same author.

During the last decade, an extensive list of authorship verification methods have been proposed [ 4, 6, 8, 9 ]. In addition, several previous PAN editions included a relevant shared task [ 10, 11, 12, 13, 14 ]. The efectiveness of authorship verification approaches depends on several factors. Naturally, text length is a crucial factor and usually the efectiveness of systems deteriorates when only short or very short texts are given. Another very challenging form of the task considers cases where texts of known and unknown authorship belong to diferent domains. In cross-domain authorship verification , texts of known and unknown authorship may difer in topic (politics vs. sports), genre (review vs. essay) or even language (English vs. German). In PAN 2015, Both cross-topic and cross-genre authorship verification were considered, and results were with relatively low accuracy were obtained, especially for a cross-genre dataset of essays and reviews in Dutch [ 12 ]. In the last two editions of PAN [ 13, 14 ] fanfiction texts (i.e., non-professional fiction published online by fan authors) belonging to diferent fandoms (i.e., fanfiction inspired by certain highly popular works) were used. A large training dataset of more than 350,000 verification instances was compiled for this task that enabled the application of powerful deep learning models [ 15 ]. Perhaps surprisingly, the best results obtained were rather high, suggesting that most fanfiction authors may retain their stylistic choices over diferent fandoms, albeit other factors that may have artificially boosted the results could not be ruled out.

The current edition of PAN focuses on cross-discourse type authorship verification where texts of known and unknown authorship belong to diferent discourse types. In particular, these discourse types have significant diferences concerning communicative purpose, intended audience, or level of formality. For example, the discourse types of argumentative essays and text messages sent to family members have important stylistic diferences imposed by the norms of discourse types. It is therefore very challenging to distinguish authorial characteristics that remain intact across discourse types. In addition, discourse types strongly correlates with text length (e.g., essays are much longer than text messages) and cross-discourse type authorship verification can also be used to study the efect of text length in the efectiveness of authorship verification approaches approaches.

In this paper, we first present the new datasets and the evaluation framework for the crossdiscourse type authorship verification shared task at PAN 2022. Next, we shall survey the received submissions and evaluate in detail their efectiveness. Finally, we discuss the main conclusions and possible directions for future work.

2. The PAN Cross-Discourse Type Authorship Verification Corpus 2022

A novel dataset was created from a subset of the recent Aston 100 Idiolects Corpus in English (Kredens, Heini and Pezik 2021),1 including a rich set of discourse types authored by 112 individuals. We used the following discourse types of written language: emails, essays, text messages, and business memos. All individuals represented in the corpus have a similar age (18–22) and are native speakers of English. The topic of text samples is not restricted, while the level of formality can vary within a certain discourse type (e.g., text messages may be addressed to family members or other acquaintances). Table 1 gives an overview of the data and the parts of it used of training and testing diferent aspects of cross-discourse type authorship verification.

This corpus has been anonymized in that named-entities such as mentions of locations, person names, addresses, etc. were manually replaced with generic placeholder tags. This is very useful for evaluating authorship verification methods in cross-discourse type scenarios since the presence of author-specific and topic-specific information is reduced.

In order to compile the required training and test datasets for the shared task at hand, the corpus needed further preprocessing. First, we split the available individuals into two equal and non-overlapping sets, one to be used for the training dataset and the other for the test dataset. That way, it is ensured that any kind of particularities among the training authors will not afect the efectiveness on the test dataset. In addition, we took advantage of the demographic metadata available and ensured a stable gender distribution of individuals in both the training and test dataset. More specifically, the training and test datasets represent writings by 56 authors each (10 male, 45 female and 1 of unidentified gender).

The dataset comprises a set of text pairs and in each pair the two texts belong to two diferent discourse types. All six combinations of the four available discourse types are taken into account. However, the distribution of text pairs over the combination of discourse types is not homogeneous since it depends on the available texts belonging to each discourse type. For example, the corpus comprises only one business memo and multiple email messages per individual. Nevertheless, the distribution of verification instances per discourse type combination is similar in both training and test datasets as can be seen in Table 1. Similarly, both training and test datasets have a balanced distribution of positive/negative verification cases. This is also valid for each combination of discourse types (e.g., half of the pairs belonging to the combination essay–email is positive and the other half is negative).

Since the length of texts belonging to certain discourse types can be limited, we concatenated multiple texts of the same discourse type to produce longer text samples. In more detail, email messages were concatenated so that a text sample of at least 2,000 characters was obtained. The date of email messages was taken into account so that consecutive messages are concatenated. In the case of text messages, we concatenated messages sent either to friends or to family, so that text samples of at least 500 characters were obtained. We inserted the special tag <new> in the concatenated messages to indicate the original message boundaries. The text lengths in Table 1 for email and text messages refer to text samples created in this manner.

3. Evaluating Cross-Discourse Type Authorship Verification

In authorship verification, one has to approximate the target function : (, ) → {, }, where is a set of texts of known authorship and is a text of unknown or disputed authorship. In the current edition of the task, we consider as singleton. Thus, the task is to approximate the target function : (, ) → {, } for a pair of texts. If (, ) = , then the author of is also the author of (positive instance) and if (, ) = , then the author of is not the same as the author of (negative instance). The main novelty of the current edition is that and belong to diferent discourse types.

The evaluation framework is similar to the one used in recent shared tasks at PAN [ 14 ]. For each authorship verification instance (a pair of texts) of the test dataset, participants have to produce a scalar score (in the [ 0, 1 ] range) indicating the probability that the pair was written by the same author. It is possible for participants to leave text pairs unanswered by submitting a score of precisely = 0.5.

3.1. Evaluation Measures

Similar to recent editions of the authorship verification task [ 14 ], we adopt a diverse set of efectiveness measures to highlight diferent aspects of the capabilities of an authorship verification model. We reused the four measures from the 2020 edition, but also included the Brier score [ 16 ] as an additional fifth measure (following discussions with participants and the audience at the 2020 workshop). In total, the following efectiveness measures were used: • AUROC: the area under the ROC curve, • c@1: a variant of the conventional accuracy measure, which rewards systems that leave dificult problems unanswered [ 17 ], • F1: the well-known F1 efectiveness measure ( not taking into account non-answers), • F0.5: a newly-proposed F0.5-based measure that emphasizes correctly-answered sameauthor cases and rewards non-answers [ 18 ], • Brier: the complement of the Brier loss function [ 16 ] focusing on the accuracy of probabilistic predictions (as implemented in sklearn [ 19 ]). This measure rewards verifiers that make “bold” but correct predictions (i.e., close to 0.0 or 1.0) and it indirectly penalizes less confident ones, including non-answers ( = 0.5). In line with the other measures we take its complement so that higher scores correspond to better efectiveness. • The average of the above measures is used as final score to rank submitted systems. We also report runtime on TIRA to give an indication of relative eficiency.

3.2. Baselines

In order to facilitate the comparison of the submitted methods with established approaches from the literature in the field, we provide two baseline methods that are based on character n-grams or character sequences. The source code of the following two methods were made available to the participants at the start of the campaign (together with an oficial implementation of the evaluation measures): • Compression-based model. Given a pair of texts 1 and 2, the cross-entropy of 2 based on the Prediction by Partial Matching (PPM) model of 1 is computed, and vice-versa [ 20 ]. Then, a logistic regression classifier is trained using the mean and absolute diference of the two cross-entropies. In addition, using a small radius verification scores around 0.5 are set to exactly 0.5. • Distance-based character n-gram model [ 21 ]. The most frequent character 4-grams are extracted from the training texts and used to represent each text. Then, given a pair of texts, the cosine similarity between them is calculated. During training, two threshold values 1 and 2 are optimized to scale the verification scores. All verification scores lower than 1 correspond to negative answers, all scores greater than 2 are scaled to positive answers and the remaining scores are set to 0.5, implying that these are hard instances that deliberately are left unanswered.

The baselines are not tailored to particular discourse types, e.g., by tuning hyperparameters.

4. Survey of Submissions

We received seven submissions and evaluated their efectiveness and eficiency using the TIRA integrated research architecture [ 22 ]. All participants also submitted a notebook paper describing their approach. The main characteristics of each approach are provided in Table 2.

Most participants followed the recent trend in natural language processing and used pretrained language models like BERT, T5, or MPNET to obtain text embeddings. Konstantinou et al. [ 23 ] report that several such models were compared and the most efective one selected. Approaches not using pre-trained language models exploit graph-based text representations [ 24 ], spectral analysis [ 25 ], or representations based on traditional feature engineering including features like frequencies of part-of-speech (POS) tags and word unigrams (najafi22).

Regarding the classification model, most participants rely on fully-connected layers that combine the information from the text representation step. It is also reported that several traditional machine learning algorithms, such as support vector machines and random forests were examined but their efectiveness was found to be comparatively low [ 23 ]. Other deep learning methods used are convolutional and siamese neural networks. Since the use of deep learning technology usually requires a considerable amount of training and some extra validation data, some participants attempted to augment the provided dataset by generating new authorship verification instances with the help of the available metadata.

Surprisingly, no participant studied discourse type-specific approaches for the given combinations despite their substantial diferences.

5. Evaluation Results

AUROC This section presents an in-depth analysis of the efectiveness and eficiency of the submitted approaches regarding overall, dependent on discourse type, with respect to bias, runtime, and in comparison to the previous year’s participants.

5.1. Overall results

Table 3 shows the overall results of all participants. In general, the efectiveness of all submissions is quite low, reflecting the dificulty of the task. The approaches of najafi22, galicia22, and jinli22 clearly outperform the rest of the submissions. It is also surprising that a naive baseline achieved the best overall score, despite the fact that most participant models are quite sophisticated. On the other hand, the most efective method submitted ( najafi22) outperforms all other submissions and baselines in three out of five evaluation measures. Its main weakness seems to be the low Brier score which means that its probabilistic predictions are in need of improvement (even if its binary class assignments are relatively strong).

Brier

F0.5 Brier Overall

F0.5 Brier Overall (d) Business memo–Text message

(f) Essay–Business memo

F0.5 Brier Overall

5.2. Results by discourse type

that each discourse type comes with diferent average text lengths (see Table 1). For instance, essays are much longer than the rest of the examined discourse types. As Table 4b, c, and f show, when essays are part of a pairing, the submission of galicia22 is the most efective system in terms overall efectiveness. Where essays are excluded (Table 4a, e, and d), their approach is outperformed by that of najafi22. On the shortest discourse types (business memos and text messages; Table 4d) the submission of jinli22 seems to be the most efective. This pairing of discourse type also has the lowest overall efectiveness, indicating that text length (plus cross-discourse verification) remains a crucial factor in authorship verification. The baseline-cngdist22 is relatively stable across combinations of discourse types, while baseline-compressor22 achieves its optimal results when the longest discourse types (essays 5.3. Bias

Positive Negative Unanswered

5.4. Eficiency

Beyond efectiveness, another criterion for evaluating an authorship verification system is in terms of eficiency or its runtime cost. Depending on the application of specific kinds of technology, this is a significant criterion, especially when large volumes of text have to be analyzed. Table 5b shows the elapsed time of the run of each submitted method on TIRA. As can be seen, the approaches that avoid the use of pre-trained language models [ 25, 24 ] achieve the lowest runtime by a large margin. The highest runtime is required by the approach of huang22 that splits texts into segments and examines all combinations of segments.

5.5. A Transfer-learning Experiment

We applied the top-performing approaches from the previous 2021 edition of PAN [ 30 ] to the current test dataset. Thanks to software submissions at TIRA, this can be accomplished with relative ease. This amounts to a transfer-learning experiment, since the three models are trained and fine-tuned on a cross-fandom authorship verification dataset but now tested on our cross-discourse type dataset. The following methods have been employed: • boenninghoff21 [ 31 ]: A deep learning-based approach including neural feature extraction and deep metric learning, deep Bayes factor scoring, uncertainty modeling and adaptation, a combined loss function, and an additional out-of-distribution detector for non-responses. In its final step, the model was extended to a majority-voting ensemble. • embarcaderoruiz21 [ 32 ]: Its main idea is similar to that of galicia22. A graph-based representation approach is combined with a Siamese network. • weerasinghe21 [ 33 ]: A variety of stylometric features, including character and POS n-grams, function words, and vocabulary richness measures and a logistic regression classifier, fed with the absolute diferences of these features for each text pair.

We made no attempt to modify these methods before applying them to the new cross-discourse type test dataset.

The efectiveness of the above-mentioned methods on the PAN 2021 test data was exceptional. All of them obtained an overall score (over the same five evaluation measures used in this paper) of greater than 0.93 [ 30 ]. Table 6a shows the efectiveness of the 2021 models on the 2022 test data. Unsurprisingly, the three models perform much worse. Their overall efectiveness on the cross-discourse type dataset is very low, much lower than all but one of the seven submissions and the two baselines shown in Table 3. This means that fine-tuning such models to particular datasets hurts their generalizability. Moreover, cross-fandom verification and cross-discourse type verification have diferent characteristics in terms of the two available datasets.

Table 6b shows the number of positive and negative answers as well as non-answers for each of the three 2021 models, which exert a clear bias of models towards negative answers. Note that in the 2021 cross-fandom dataset, all texts have similar text length. Likely, this factor along with other substantial diferences between fanfiction and the discourse types considered in the cross-discourse type dataset confuse these models (or at least that they need appropriate finetuning to improve the scaling of the produced verification scores). Note that the AUROC scores (which do not depend on the scaling of verification scores) are also quite low.

6. Conclusion

Previous shared tasks on authorship attribution at PAN played a crucial role to advance research in the field of authorship analysis and modern methods have been using the PAN datasets for evaluation purposes extensively and have incrementally improved the state of the art [ 6, 8 ]. Recent editions of PAN focused on fanfiction. The very good results obtained by the topperforming submissions there may have given the false impression that authorship verification is an almost solved problem [ 13, 14 ]. This is in fact not the case, as our experiment shows.

This year, we focused on a very challenging version of the authorship verification task where text pairs of diferent discourse types are used. When texts difer in communicative purpose, intended audience, or level of formality, it is very challenging to identify stable characteristics associated with authors across these discourse types. The efectiveness of all submissions in the cross-discourse type datasets was comparatively low, some as low as a random-guess baseline.

It is also surprising that all submissions, despite their increased level of sophistication in most of the cases, were outperformed by a naive baseline based on character n-grams and cosine similarity (at least according to the overall efectiveness across all five evaluation measures). This suggests that traditional methods based on well-known stylometric features could still be more efective than deep learning approaches using modern pre-trained language models for this challenging task. Another factor is the volume of data available for training (roughly, 12,000 instances) that can be considered too little for deep learning-based approaches.

Another crucial issue is text length. It seems that when the relatively long essays were used as inputs, the graph-based approach of galicia22 was more efective. When shorter texts from discourse types like emails, business memos, and text messages were used, the pre-trained language-model-based approaches of najafi22 and jinli22 were more efective.

The overall low efectiveness achieved shows that there is a lot of room for improvement in cross-discourse type authorship verification. All submitted approaches adopted a unified model that predicts authorship disregarding combinations of discourse types. Having separate models for each combination of discourse types is an obvious next step. This would mean, however, that the training data should also be split into smaller parts based on the combinations of discourse types. An ensemble method combining traditional stylometric models and pre-trained language models appears like a promising approach in this regard. open world settings, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021.

[1]

Stamatatos , A survey of modern authorship attribution methods , JASIST 60 ( 2009 ) 538 - 556 . URL: https://doi.org/10.1002/asi.21001. doi: 10 .1002/asi.21001.

[2]

Koppel ,

Schler ,

Argamon , Computational methods in authorship attribution , Journal of the American Society for Information Science and Technology 60 ( 2009 ) 9 - 26 .

[3]

Koppel ,

Schler ,

Argamon , Authorship attribution in the wild , Language Resources and Evaluation 45 ( 2011 ) 83 - 94 . doi: 10 .1007/s10579-009-9111-2.

[4]

Stamatatos , Authorship verification: A review of recent advances , Research in Computing Science 123 ( 2016 ) 9 - 25 .

[5]

Halvani ,

Graner ,

Regev , Taveer: an interpretable topic-agnostic authorship verification method , in: M. Volkamer , C. Wressnegger (Eds.), ARES 2020: The 15th International Conference on Availability, Reliability and Security , ACM, 2020 , pp. 41 : 1 - 41 : 10 .

[6]

Potha , E. Stamatatos, Improving author verification based on topic modeling , Journal of the Association for Information Science and Technology 70 ( 2019 ) 1074 - 1088 . doi:https://doi.org/10.1002/asi.24183.

[7]

Koppel ,

Winter , Determining if two documents are written by the same author , Journal of the Association for Information Science and Technology 65 ( 2014 ) 178 - 187 .

[8]

Ding ,

Fung ,

Iqbal ,

Cheung , Learning stylometric representations for authorship analysis , IEEE Transactions on Cybernetics 49 ( 2019 ) 107 - 121 .

[9]

Boenninghof ,

R. M.

Nickel ,

Zeiler ,

Kolossa , Similarity learning for authorship verification in social media , in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2019 , pp. 2457 - 2461 . doi: 10 .1109/ICASSP. 2019 . 8683405 .

[10]

Gollub ,

Potthast ,

Beyer ,

Busse ,

F. M. R.

Pardo ,

Rosso ,

Stamatatos ,

Stein , Recent trends in digital text forensics and its evaluation - plagiarism detection, author identification, and author profiling , in: P. Forner,

Müller ,

Paredes ,

Rosso , B. Stein (Eds.), Information Access Evaluation . Multilinguality, Multimodality, and Visualization - 4th International Conference of the CLEF Initiative, CLEF 2013 , Valencia, Spain, September 23-26 , 2013 . Proceedings, volume 8138 of Lecture Notes in Computer Science, Springer, 2013 , pp. 282 - 302 .

[11]

Potthast ,

Gollub ,

F. M. R.

Pardo ,

Rosso ,

Stamatatos ,

Stein , Improving the reproducibility of pan's shared tasks: - plagiarism detection, author identification, and author profiling , in: E. Kanoulas,

Lupu ,

P. D.

Clough ,

Sanderson , M. M. Hall , A. Hanbury , E. G. Toms (Eds.), Information Access Evaluation . Multilinguality, Multimodality, and Interaction - 5th International Conference of the CLEF Initiative, CLEF 2014 , Shefield , UK, September 15-18 , 2014 . Proceedings, volume 8685 of Lecture Notes in Computer Science, Springer, 2014 , pp. 268 - 299 .

[12]

Stamatatos ,

Potthast ,

F. M. R.

Pardo ,

Rosso ,

Stein , Overview of the PAN/CLEF 2015 evaluation lab , in: J. Mothe , J.

Savoy , J.

Kamps , K.

Pinel-Sauvagnat , G. J. F.

Jones , E. SanJuan, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality , Multimodality, and Interaction - 6th International Conference of the CLEF Association, CLEF 2015 , Toulouse, France, September 8- 11 , 2015 , Proceedings, volume 9283 of Lecture Notes in Computer Science, Springer, 2015 , pp. 518 - 538 .

[13]

Bevendorf ,

Ghanem ,

Giachanou ,

Kestemont ,

Manjavacas , I. Markov,

Mayerl ,

Potthast ,

F. M. R.

Pardo ,

Rosso ,

Specht ,

Stamatatos ,

Stein ,

Wiegmann , E. Zangerle, Overview of PAN 2020: Authorship verification, celebrity profiling, profiling fake news spreaders on twitter, and style change detection , in: A. Arampatzis , E. Kanoulas, T.

Tsikrika , S.

Vrochidis , H.

Joho , C.

Lioma , C.

Eickhof , A.

Névéol , L.

Cappellato , N. Ferro (Eds.), Experimental IR Meets Multilinguality , Multimodality, and Interaction - 11th International Conference of the CLEF Association, CLEF 2020 , Thessaloniki, Greece, September 22-25 , 2020 , Proceedings, volume 12260 of Lecture Notes in Computer Science, Springer, 2020 , pp. 372 - 383 .

[14]

Bevendorf ,

Chulvi ,

G. L. D. la Peña

Sarracén ,

Kestemont ,

Manjavacas , I. Markov,

Mayerl ,

Potthast ,

Rangel ,

Rosso ,

Stamatatos ,

Stein ,

Wiegmann ,

Wolska , E. Zangerle, Overview of PAN 2021: Authorship verification, profiling hate speech spreaders on twitter, and style change detection , in: K. S. Candan,

Ionescu ,

Goeuriot ,

Larsen ,

Müller ,

Joly ,

Maistro ,

Piroi , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality , Multimodality, and Interaction - 12th International Conference of the CLEF Association, CLEF 2021 ,

Virtual

Event , September 21-24 , 2021 , Proceedings, volume 12880 of Lecture Notes in Computer Science, Springer, 2021 , pp. 419 - 431 .

[15]

Bischof ,

Deckers ,

Schliebs ,

Thies ,

Hagen ,

Stamatatos ,

Stein ,

Potthast , The Importance of Suppressing Domain Style in Authorship Analysis , CoRR abs/ 2005 .14714 ( 2020 ). URL: https://arxiv.org/abs/ 2005 .14714.

[16]

G. W.

Brier , et al., Verification of forecasts expressed in terms of probability , Monthly weather review 78 ( 1950 ) 1 - 3 .

[17]

Peñas ,

Rodrigo , A simple measure to assess non-response, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 , HLT '11, Association for Computational Linguistics, USA, 2011 , p. 1415 - 1424 .

[18]

Bevendorf ,

Stein ,

Hagen ,

Potthast , Generalizing unmasking for short texts , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis , MN, USA, June 2-7, 2019 , Volume 1 (Long and Short Papers), Association for Computational Linguistics , 2019 , pp. 654 - 659 . URL: https://doi.org/10.18653/v1/n19- 1068 . doi: 10 .18653/v1/n19- 1068 .

[19]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , E. Duchesnay, Scikit-learn: Machine learning in Python , Journal of Machine Learning Research 12 ( 2011 ) 2825 - 2830 .

[20]

W. J.

Teahan ,

D. J.

Harper , Using Compression-Based Language Models for Text Categorization , Springer Netherlands, Dordrecht, 2003 , pp. 141 - 165 . URL: https://doi.org/10.1007/ 978 -94-017-0171- 6 _7. doi: 10 .1007/ 978 -94-017-0171- 6 _ 7 .

[21]

Kestemont ,

Stover ,

Koppel ,

Karsdorp , W. Daelemans, Authenticating the writings of julius caesar , Expert Systems with Applications 63 ( 2016 ) 86 - 96 .

[22]

Potthast ,

Gollub ,

Wiegmann ,

Stein , TIRA integrated research architecture , in: N. Ferro , C. Peters (Eds.), Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF , volume 41 of The Information Retrieval Series , Springer, 2019 , pp. 123 - 160 . URL: https://doi.org/10.1007/978-3- 030 -22948- 1 _5. doi: 10 .1007/978-3- 030 -22948-1\_5.

[23]

Konstantinou ,

Zinonos ,

Li , Diferent Encoding Approaches for Authorship Verification, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2022 .

[24]

J. A.

Martinez-Galicia ,

Embarcadero-Ruiz , A . R.-O. na, H. Gómez-Adorno , Graph-Based Siamese Network for Authorship Verification, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2022 .

[25]

Crespo-Sanchez ,

Gómez-Adorno ,

Lopez-Arevalo ,

Aldana-Bobadilla ,

Salas-Jimenez ,

Cortes-Lopez , A Content Spectral-based Analysis for Authorship Verification, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2022 .

[26]

Huang ,

Kong ,

Peng ,

Ye ,

Li ,

Jiang , Z. Han, Authorship verification Based On Fully Interacted Text Segments, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2022 .

[27]

Lei ,

Qi , H. Y,

Peng ,

Huang , Application of BERT in Author Verification Task, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2022 .

[28]

Najafi , E. Tavan, Text-to-Text Transformer in Authorship Verification Via Stylistic and Semantical Analysis, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2022 .

[29]

Ye , H. Y,

Peng ,

Huang ,

Kong , Z. Han, Authorship Verification Using Convolutional Neural Network, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2022 .

[30]

Kestemont ,

Stamatatos ,

Manjavacas ,

Bevendorf ,

Potthast ,

Stein , Overview of the Authorship Verification Task at PAN 2021 , in: G. Faggioli,

Ferro ,

Joly ,

Maistro ,

Piroi (Eds.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2021 .

[31]

Boenninghof ,

R. M.

Nickel ,

Kolossa , O2D2: Out-of-distribution detector to capture undecidable trials in authorship verification , in: G. Faggioli,

Ferro ,

Joly ,

Maistro ,

Piroi (Eds.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2021 .

[32]

Embarcadero-Ruiz ,

Gómez-Adorno ,

Reyes-Hernández ,

García ,

Embarcadero-Ruiz , Graph-based Siamese network for authorship verification , in: G. Faggioli,

Ferro ,

Joly ,

Maistro ,

Piroi (Eds.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2021 .

[33]

Weerasinghe ,

Singh ,

Greenstadt , Feature vector diference based authorship verification for