=Paper= {{Paper |id=Vol-2936/paper-147 |storemode=property |title=Overview of the Cross-Domain Authorship Verification Task at PAN 2021 |pdfUrl=https://ceur-ws.org/Vol-2936/paper-147.pdf |volume=Vol-2936 |authors=Mike Kestemont,Enrique Manjavacas,Ilia Markov,Janek Bevendorff,Matti Wiegmann,Efstathios Stamatatos,Benno Stein,Martin Potthast |dblpUrl=https://dblp.org/rec/conf/clef/KestemontMMBWS021 }} ==Overview of the Cross-Domain Authorship Verification Task at PAN 2021== https://ceur-ws.org/Vol-2936/paper-147.pdf

Overview of the Cross-Domain Authorship
Verification Task at PAN 2021
Mike Kestemont1 , Enrique Manjavacas1 , Ilia Markov1 , Janek Bevendorff2 ,
Matti Wiegmann2 , Efstathios Stamatatos3 , Benno Stein2 and Martin Potthast4
1
University of Antwerp
2
Bauhaus-Universität Weimar
3
University of the Aegean
4
Leipzig University
pan@webis.de https: // pan. webis. de

Abstract
Idiosyncrasies in human writing styles make it difficult to develop systems for authorship identification
that scale well across individuals. In this year’s edition of PAN, the authorship identification track fo-
cused on open-set authorship verification, so that systems are applied to unknown documents by previ-
ously unseen authors in a new domain. As in the previous year, the sizable materials for this campaign
were sampled from English-language fanfiction. The calibration materials handed out to the partici-
pants were the same as last year, but a new test set was compiled with authors and fandom domains not
present in any of the previous datasets. The general setup of the task did not change, i.e., systems still
had to estimate the probability of a pair of documents being authored by the same person. We attracted
13 submissions by 10 international teams, which were compared to three complementary baselines, us-
ing five diverse evaluation metrics. Post-hoc analyses show that systems benefitted from the abundant
calibration materials and were well-equipped to handle the open-set scenario: Both the top-performing
approach and the highly competitive cohort of runner-ups presented surprisingly strong verifiers. We
conclude that, at least within this specific text variety, (large-scale) open-set authorship verification is
not necessarily or inherently more difficult than a closed-set setup, which offers encouraging perspec-
tives for the future of the field.

1. Introduction
This paper provides a full-length description of the authorship verification shared task at
PAN 2021. This edition was the second task installment in a renewed three-year program on the
PAN authorship track (2020–2022), in which the scope, the difficulty and, the realism of the tasks
are gradually increased each year. After last year’s edition focused on providing participants
with the largest pool of calibration material by far of any previous authorship shared task at
PAN—a technical challenge in its own right—, we sought to improve the difficulty this year by
sampling a fully disjunct test set. This is different to last year’s edition where the overall task
difficulty was kept in check by means of resorting to a closed-set evaluation scenario in which
the test set was restricted to only authors and fandom domains also included in the calibration
set (hence a clever participant could re-cast the task as an attribution task). This year’s test set,

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
on the other hand, comes with document pairs of exclusively unseen authors writing in unseen
fandom domains, which results in an open-set or “true” authorship verification scenario, which
is conventionally considered a much more demanding problem than attribution. For the next
year, we planned a consecutive and final “surprise task”, on which more details will be released
in due time.
In the following, we first contextualize and motivate the design choices outlined above. Next,
we shall formalize the task, describe the composition of this year’s test set, and detail the
employed evaluation metrics as well as the three generic baseline systems that were applied as
a point of reference. In the sections following that, we shall briefly discuss the participating
systems through a summary of their respective notebooks and the results of the task in tandem
with a statistical analysis to assess whether pairs of systems in fact produced significantly
different outcomes. In our discussion, we present post hoc analyses—including a comparison
with last year’s results regarding the distribution of scores—, the effect of non-answers, and
the relationship of stylistic and topical similarities. Finally, we assess the contributions of this
year’s edition regarding the closed-set vs. open-set debate and offer an outlook into the future.

1.1. Motivation and Design Rationale
Much of the research in present-day computational authorship identification is implicitly
underpinned by a basic assumption that could be summarized as the “Stylome Hypothesis”. This
hypothesis, seminally formulated by van Halteren et al. [1], states that all writing individuals
would leave a unique stylistic and linguistic “fingerprint” in their work, i.e., a set of stable
empirical characteristics that can be extracted from and identified in a large-enough writing
sample. In the analogy of the human genome, the assumption is that this fingerprint is a
sufficient means to identifying the author of any given writing sample, provided it is long
enough. The Stylome Hypothesis is an attractive working hypothesis, but remains hard to
demonstrate, let alone prove. Experimental studies in the past decades have enabled scholars to
close in on the experimental conditions that must be met for an authorship identification: we
know that texts have to be long enough to be analyzed in the first place and verification across
different text varieties has proven to be very challenging, not to mention issues of collaborative
authorship or copy editors who inject additional stylistic noise. Cases where a reliable set of
candidate authors is already available, are easier to solve than those where such a list cannot be
established.
One general property of human authorship that has emerged in various studies appears to be
its ad-hoc nature: Even within a single genre, textual features that work well to differentiate
author A from a set of peers, might fail to separate author B from the same set of peers. Due to
the many idiosyncracies that occur in an individual’s writing style, this makes it challenging
to develop systems that can be robustly scaled across many different individuals. Modeling
authorial writing style requires bespoke models that are tailored to the characteristics of a single
author or a specific set of authors. These observations tie in with two important scenarios that
are commonly distinguished in the field: closed-set and open-set authorship verification. The
former term describes the situation in which a system is applied to a set of texts by authors who
are already known to the system (as they were seen during the training or calibration phase).
The latter term describes the scenario in which a system is applied to texts whose authors are
(potentially) unknown. This open-set scenario is supposedly much more challenging, since
one would expect verification systems to overfit on textual properties that are significant for
distinguishing this author from their known peers, but which may eventually turn out not to
be a general characteristic of their style and hence not distinguish them from other, unknown
authors.
This state of affairs has clearly motivated and shaped the shared task in authorship identifica-
tion at PAN over the years. In particular, three factors have informed the design of the tasks:
(1) issues of scale, (2) methodological developments, and (3) the ad-hoc nature of authorship.
First of all, to reliably assess the plausibility of the Stylome Hypothesis, much larger corpora are
required than were previously available. It is only in recent years, in fact, that larger datasets
for authorship attribution have become more widespread. This concern relating to scale is
closely related to methodological developments in the field. In the 2018 task overview paper, the
organizers voiced serious concerns about a noticeable lack of diversity in the submitted systems.
Save a few exceptions, most of the systems then took the form of a simple classifier (typically a
linear SVM or decision tree) that was applied to a bag-of-words representation of documents
on the basis of character n-grams and other conventional feature sets. This methodological
dearth was remarkable, since deep (neural) representation learning had already been shaping
the landscape of NLP for several years. Such late adoption of deep neural models for authorship
identification was very likely an immediate result of a lack of sufficient training resources as
are typically required for representation learning (in particular for the data-hungry pre-training
and finetuning of sentence- or document-level embeddings).

2. Authorship Verification
The most central element of authorship analysis is the identification of the document’s au-
thor(s) [2, 3, 4]. In various fields, scholars have been studying how stylistic and linguistic
properties of documents can be harnessed for the achievement of this goal. Because of the
variety in authorial styles, including diachronic and synchronic shifts, progress in the field of
style-based document authentication is hard to monitor, as it requires extensive, transparent,
and repeated benchmarking initiatives [5]. The long-running authorship identification track at
PAN hopes to contribute in this area and has organized tasks on authorship identification in
various guises. The following section offers an overview of the central concepts as an update
on a previously published survey [6]:

• Authorship Attribution: Given a document and a set of candidate authors, determine
who wrote the document (PAN 2011–2012, 2016–2020);
• Authorship Verification: Given a pair (or collection) of documents, determine whether
they are written by the same author (PAN 2013–2015, 2021);
• Authorship Obfuscation: Given a set of documents by the same author, paraphrase
one or all of them so that its author cannot be identified anymore (PAN 2016–2018);
• Obfuscation Evaluation: Devise and implement performance measures that quantify
safeness, soundness, and / or sensibleness of an obfuscation software (PAN 2016–2018).
The formal goal of authorship verification is to approximate the target function 𝜑 : (𝐷𝑘 , 𝑑𝑢 ) →
{𝑇, 𝐹 }, where 𝐷𝑘 is a set of documents of known authorship by the same author and 𝑑𝑢 is a
document of unknown or questioned authorship.1 If 𝜑(𝐷𝑘 , 𝑑𝑢 ) = 𝑇 , then the author of 𝐷𝑘
is also the author of 𝑑𝑢 and if 𝜑(𝐷𝑘 , 𝑑𝑢 ) = 𝐹 , then the author of 𝐷𝑘 is not the same as the
author of 𝑑𝑢 . In the case of cross-domain verification, 𝐷𝑘 and 𝑑𝑢 stem from a different text
variety or encompass considerably different content (e.g. topics or themes, genres, registers,
etc.). For the present task, we considered the simplest (and most challenging) formulation of
the verification task, i.e., we only considered cases where 𝐷𝑘 is a singleton, thus only pairs of
two documents are examined. Given a training set of such problems, the verification systems
of the participating teams had to be trained and calibrated to analyze the authorship of the
unseen text pairs (from the test set). We shall distinguish between same-author text pairs (SA:
𝜑(𝐷𝑘 , 𝑑𝑢 ) = 𝑇 ) and different-author (DA: 𝜑(𝐷𝑘 , 𝑑𝑢 ) = 𝐹 ) text pairs. In terms of setup, the
novelty this year was that (1) the authors and (2) the stories’ fandom domains in the test set
were not part of any of the provided calibration materials, which, theoretically speaking, should
make this year’s task more challenging than last year’s.

3. Datasets
Given our aim to benchmark authorship identification systems at a much larger scale, our tasks
in recent years [8, 9] focused on transformative literature, or so-called “fanfiction” [10], a text
variety that is nowadays abundantly available on the internet [11] with rich metadata and
in many languages. Additionally, fanfiction is an excellent source of material for studies of
cross-domain scenarios, since users often publish “fics” ranging over multiple topical domains
(“fandoms”), such as Harry Potter, Twilight, or Marvel comics. The datasets we provided for our
tasks at PAN 2020 and PAN 2021 were crawled from the long-established fanfiction community
fanfiction.net. Access to the data can be requested on Zenodo.2 The 2021 edition of the
authorship verification task built upon last year’s [7] with the same general task layout and
training data, but with a conceptually different test set. We retained the overall cross-domain
setting, in which the texts in a pair stem from different fandoms, but we replaced the closed-set
setting with an open-set setting, where both the authors and the fandoms in the test set are
entirely “new” and do not occur in the training set.
The training resources were identical to those from last year and came in a “small” and a
“large” variant. The large dataset contains 148,000 same-author and 128,000 different-author
pairs across 1,600 fandoms. Each single author has written in at least two, but not more than
six fandoms. The small training set is a subset of the large training set with 28,000 same-author
and 25,000 different-author pairs from the same 1,600 fandoms. The new test was sampled with
the same general strategy (19,999 text pairs in total), but in a way so as to fulfill the previously
described open-set constraints to make the task—at least in theory—more difficult.

1
This paragraph is based on last year’s overview paper [7] and included for the sake of completeness.
2
https://zenodo.org/record/3716403
4. Evaluation Framework
For each of the 19,999 problems (or document pairs) in the test set, the systems had to produce
a scalar score 𝑎𝑖 in the range [0, 1] indicating the (scaled) probability that the pair was written
by the same author (𝑎𝑖 > 0.5) or different authors (𝑎𝑖 < 0.5). Systems could choose to leave
problems they deemed too difficult to decide unanswered by submitting a score of precisely
𝑎𝑖 = 0.5. Such a non-answer is rewarded by some of the metrics over a wrong answer.

4.1. Performance Measures
Similar to year, we adopted a diverse mix of evaluation metrics that focused on different
aspects of the verification task at hand. We reused the four evaluation metrics from the 2020
edition, but also included the (complement of the) Brier score [12] as an additional fifth metric
(following discussions with participants and audience from the 2020 workshop3 ). The following
performance measures were used:

• AUC: the ROC area-under-the-curve score,
• c@1: a variant of the conventional accuracy measure, which rewards systems that leave
difficult problems unanswered [13],
• F1 : the well-known F1 performance measure (not taking into account non-answers),
• F0.5𝑢 : a newly-proposed F0.5 -based measure that emphasizes correctly-answered same-
author cases and rewards non-answers [14],
• Brier: the Brier score (more precisely: the complement of the Brier score loss function [12]
as implemented in sklearn [15]), a straightforward, strictly proper scoring rule that
measures the accuracy of probabilistic predictions.

The inclusion of the Brier score was an addition which was meant to measure the probabilistic
confidence of the verifiers in a more fine-grained manner. This metric rewards verifiers that
produce bolder but correct scores (i.e., 𝑎𝑖 close to 0.0 or 1.0). Conversely, the metric would
indirectly penalize less committal solutions, such as non-answers (𝑎𝑖 = 0.5).
To produce a final ranking for a system, we used the mean score across all individual measures.

4.2. Baselines
In total, we provided three baseline systems (calibrated on the small training set) for com-
parison, of which the first two were also employed during last year’s competition. These
were a compression-based approach [16] and a naive distance-based, first-order bag-of-words
model [17]. Both were made available to participants at the start. The third baseline was a
post-hoc addition for this overview paper and consisted of a short-text variant of Koppel and
Schler’s unmasking [18, 19], which had yielded good empirical results in the recent past.

3
Thanks to Fabrizio Sebastiani (Consiglio Nazionale delle Ricerche, Italy) for this suggestion.
5. Survey of Submissions
The authorship verification task received 13 submissions from 10 participating teams. In this
section, we provide a short and concise overview of the submitted systems. For further details
(including bibliographic references), we refer the interested reader to the full versions of these
notebooks. Teams were allowed to hand in exactly one submission per training dataset (large
and small). Three teams submitted two systems, the other teams either deliberately chose to
submit only a single variant or were unable to produce a valid run in time. The systems listed
below are described in the order in which the notebooks were initially submitted.

1. ikae21 [20] used a hard majority-voting ensemble that incorporated five different
machine-learning classifiers (i.e., linear discriminant analysis, gradient boosting, ex-
tra trees, support vector machines, and stochastic gradient descent). The features used
were top-800 TF-IDF-weighted word unigrams.
2. menta21 [21] exploited two types of stylometric features, character n-grams and punc-
tuation marks, to train a neural network on each type of feature separately. The outputs
were concatenated and fed into another neural network in order to obtain the predictions.
3. liaozhihao21 [22] used four retrieval models from the Lucene framework. Each retrieval
model assigned a probability to a piece of text that it was written by the corresponding
author. Later on, a weighted average of the probabilities was calculated to get the final
score. The approach assumes that both texts were written by the same author if the
highest final score corresponds to the same author.
4. weerasinghe21 [23] extracted stylometric features from each text pair and used the
absolute differences between the feature vectors as input to the logistic regression classifier.
The features included character and POS n-grams, special characters, function words,
vocabulary richness, POS-tag chunks, and unique spellings.
5. boenninghoff21 [24] presented a hybrid neural-probabilistic end-to-end framework,
which included neural feature extraction and deep metric learning, deep Bayes factor
scoring, uncertainty modeling and adaptation, a combined loss function, and additionally
an out-of-distribution detector for defining non-responses. In the final step, the model
was extended to a majority-voting ensemble.
6. peng21 [25] proposed an approach that split the texts into fragments and used BERT to
extract feature vectors from each fragment, which were then concatenated and fed into a
neural network for the final predictions.
7. futrzynski21 [26] proposed an approach based on the cosine similarities of output rep-
resentations extracted from BERT. These similarities were compared to several thresholds
and were rescaled in order to classify a text pair. The BERT model was trained on the
following tasks: masked language modeling, author classification, fandom classification,
and author-fandom separation. In addition, the authors proposed a method for decreasing
the computational costs by combining embeddings of many short text sequences.
8. embarcaderoruiz21 [27] proposed a novel approach consisting of a graph representation
to represent the texts, which served as input to a Siamese network. The feature extraction
network consisted of node embedding layers to obtain vector representations for each
Table 1
System rankings for all PAN 2021 submissions across five evaluation metrics: AUC, c@1, F1 , F0.5𝑢 ,
Brier, and an overall mean score (as the final ranking criterion). The dataset column indicates which
calibration dataset was used. Bold digits reflect the per-column maximum. Horizontal lines indicate
the range of scores yielded by the baselines (in italics).

Team Dataset AUC c@1 F1 F0.5𝑢 Brier Overall
boenninghoff21 large 0.9869 0.9502 0.9524 0.9378 0.9452 0.9545
embarcaderoruiz21 large 0.9697 0.9306 0.9342 0.9147 0.9305 0.9359
weerasinghe21 large 0.9719 0.9172 0.9159 0.9245 0.9340 0.9327
weerasinghe21 small 0.9666 0.9103 0.9071 0.9270 0.9290 0.9280
menta21 large 0.9635 0.9024 0.8990 0.9186 0.9155 0.9198
peng21 small 0.9172 0.9172 0.9167 0.9200 0.9172 0.9177
embarcaderoruiz21 small 0.9470 0.8982 0.9040 0.8785 0.9072 0.9070
menta21 small 0.9385 0.8662 0.8620 0.8787 0.8762 0.8843
rabinovits21 small 0.8129 0.8129 0.8094 0.8186 0.8129 0.8133
ikae21 small 0.9041 0.7586 0.8145 0.7233 0.8247 0.8050
unmasking21 small 0.8298 0.7707 0.7803 0.7466 0.7904 0.7836
tyo21 large 0.8275 0.7594 0.7911 0.7257 0.8123 0.7832
naive21 small 0.7956 0.7320 0.7856 0.6998 0.7867 0.7600
compressor21 small 0.7896 0.7282 0.7609 0.7027 0.8094 0.7581
futrzynski21 large 0.7982 0.6632 0.8324 0.6682 0.7957 0.7516
liaozhihao21 small 0.4962 0.4962 0.0067 0.0161 0.4962 0.3023

node in the graph as well as a global pooling. The authors also incorporated stylometric
features, combining them with the graph components to an ensemble.
9. tyo21 [28] used BERT within a Siamese network. The embedding space was optimized
so that texts written by the same author are adjacent in that space, while texts written by
different authors are farther apart. At inference time, the distance between embeddings
was compared to a threshold (selected based on a grid search) to make the predictions.
10. rabinovits21 [29] relied on regression models. The authors incorporated the cosine
distance for a set of vector-based features (word-, and character frequencies, POS tags,
POS chunk n-grams, punctuation, stopwords) and absolute differences for scalar features
(vocabulary richness, average sentence length, Flesch reading ease score) as measures of
text-pair similarity. The concatenated similarity scores were used as input to a random
forest model (adapted as a regressor).

Overall, we observe a healthy diversity of methods, with several novel approaches, for instance
from representation learning with neural networks, appearing among more established methods
from text classification or information retrieval. Multiple teams employed a so-called “Siamese”
neural network approach [30], which seems to be a natural choice for the analysis of text pairs.

6. Evaluation Results
Table 1 offers a tabular representation of the final results of the submitted systems on the
PAN 2021 test set. The overall ranking is based on the mean performance of the five evaluation
metrics (last column). The dataset column indicates whether a system was calibrated on the
“large” or “small” dataset. In the following, we refer to these as “large” and “small” systems or
Table 2
Significance of pairwise differences in F1 scores between submissions. Notation: ‘=’ (not signifi-
cant: 𝑝 ≥ 0.05), ‘*’ (significant with 𝑝 < 0.05), ‘**’ (significant with 𝑝 < 0.01), ‘***’ (significant
with 𝑝 < 0.001).

embarcaderoruiz21-small
embarcaderoruiz21-large

weerasinghe21-small
weerasinghe21-large

compressor21-small
unmasking21-small

liaozhihao21-small
rabinovits21-small

futrzynski21-large
menta21-small
menta21-large

naive21-small
peng21-small

ikae21-small

tyo21-large
boenninghoff21-large *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
embarcaderoruiz21-large * = *** ** *** *** *** *** *** *** *** *** *** ***
weerasinghe21-large *** *** = *** *** *** *** *** *** *** *** *** ***
weerasinghe21-small *** *** *** *** *** *** *** *** *** *** *** ***
menta21-large *** *** *** *** *** *** *** *** *** *** ***
peng21-small *** *** *** *** *** *** *** *** *** ***
embarcaderoruiz21-small *** *** *** *** *** *** *** *** ***
menta21-small *** *** *** *** *** *** *** ***
rabinovits21-small *** *** *** *** *** *** ***
ikae21-small *** * *** *** *** ***
unmasking21-small *** *** *** *** ***
tyo21-large *** *** *** ***
naive21-small = *** ***
compressor21-small *** ***
futrzynski21-large ***

submissions. In Table 2, we show a pairwise comparison of all combinations of systems to assess
whether their solutions are significantly different from each other (based on their F1 scores).
The statistical procedure we applied for this is the approximate randomization test [31], for
10,000 bootstraps per comparison.
The top-performing system this year was contributed by the participant who submitted
last year’s strongest system. Team boenninghof21 achieved an exceptionally solid and robust
performance, including the overall highest score across all evaluation metrics. The team in first
place is followed by a tight cohort of strong runner-ups (embarcaderoruiz21, weerasinghe21,
menta21, and peng21) who all achieved similar scores in the same ballpark. With the exception
of three systems (tyo21, futrzynski21, liaozhihao21), most approaches significantly outper-
formed the three (unoptimized) baselines. The baselines themselves all yielded surprisingly
similar performances, with unmasking21 being the best-performing baseline with a slight edge.
Somewhat surprisingly, the system by tyo21 turned out to not be significantly different from
the unmasking baseline, although it was based on a completely different verification approach.
For most systems, the pairwise F1 scores significantly differ (Table 2), though in the upper
echelons we see a few exceptions. This is to be expected with such exceptional (and hence
necessarily similar) performances. The top-performing approach did, in fact, produce a signifi-
cantly different solution from the runner-up, though the same is not true for all systems in the
next cohort, which indicates that their particular ranking order does not necessarily indicate
their quality, but incorporates a certain amount of chance. Some participants did well for some
scoring metrics, but showed a more pronounced drop in others. The system by ikae21 for
instance, achieved a more than respectable AUC in the lower nineties, but an F0.5𝑢 only in the
lower seventies (which should primarily be attributed to the different treatment of same-author
pairs by this metric). Overall, the non-responses played an important part in the rankings,
primarily affecting the c@1 and F0.5𝑢 scores. Systems such as liaozhihao21-small, that delivered
binary answers without any non-responses were at a clear disadvantage in this regard.
Of particular importance is the observation that if teams submitted separate systems for
the large and the small dataset, they invariably yielded significantly different solutions. Most
importantly, the “large” variant always outperformed the “small” one. It should be emphasized
that last year, the stronger performance of the large systems might could have been attributed
to the closed-set scenario, in which a sufficiently complex model could have fully memorized
each author’s individual characteristics. This effect cannot serve as an explanation in this
year’s edition, because none of the test set authors or fandoms were present in the calibration
materials. The performance improvements this year must therefore be attributed to the mere
scope or size of the dataset or other characteristics not pertaining directly to the individual
authors. This serves as additional evidence that systems were generally able to benefit from
the increased training dataset size and could capitalize on accessing more abundant and more
diverse material by more authors, even in an open-set verification scenario. It also signals clearly
that the supposed ad-hoc nature of authorship identification should not be over-estimated.
At least within a single textual domain, the results demonstrate the feasibility of modeling
authorship quite reliably and at a large scale.

7. Discussion
In this section, we provide a more in-depth analysis of the submitted approaches and their
evaluation results, also in comparison with last year’s task. First, we take a look at the distribution
of the submitted verification scores, including a meta classifier. We go on to inspect the effect
of non-responses, and finally try to analyze how topic similarities between texts in a pair might
have affected the results.

7.1. Comparison 2020–2021
Due to the intricate similarities and differences between the 2020 and 2021 editions of the
task, a more detailed comparison is worthwhile. A clear advantage of the software submission
procedure through tira.io is that we were able to rerun the systems from one year on the test
dataset of another year in most cases. This way we were able to perform a cross-evaluation of
quite a few systems with some exceptions due to unresolvable failures when running systems
on datasets which they were not designed for. These were mostly a result of hard-coded
assumptions that were violated by the new data. For example, several 2020 systems assumed
all fandoms in the test set to be known, which was in clear violation with the 2021 dataset
design. In Table 3, we present the performance of these system and data combinations in terms
of c@1. This comparison is necessarily incomplete but allows us to glean some interesting
trends. Across systems, the scores for the 2020 dataset are consistently lower than for 2021
in all instances but one (ikae). We must therefore draw the counter-intuitive conclusion that
Table 3
Cross-comparison of the performances (in terms of c@1) across different combinations of submissions
(2020 vs. 2021) and test datasets (also 2020 vs. 2021). Some combinations could not be evaluated due to
failures when running the system on a dataset it was not designed for.

Team 2020 System 2021 System
2020 Data 2021 Data 2020 Data 2021 Data
niven 0.786 – – –
araujo 0.770 0.81 – –
boenninghoff 0.928 – 0.917 0.950
weerasinghe 0.880 0.913 0.885 0.917
ordonez 0.640 – – –
faber 0.331 – – –
ikae 0.544 0.503 0.742 0.758
kipnis 0.801 0.815 – –
gagala 0.786 0.804 – –
halvani 0.796 0.822 – –
embarcaderoruiz – – 0.914 0.930
menta – – 0.878 0.902
peng – – – 0.917
rabinovits – – 0.795 0.812
tyo – – – 0.759
futrzynski – – 0.662 0.663
liaozhihao – – – 0.496

the open-set formulation with unseen authors and topical domains was, in fact, easier to solve
than the closed setting. On the other hand, the new 2021 systems tended to underperform on
the 2020 dataset in comparison with the original 2020 submission by the same team—at least
in the preciously rare cases in which we were able to make this comparison (i.e., boeninghoff,
weerasinghe, ikea).

7.1.1. Distributions
Figure 7.1.1 (left) visualizes the overall distribution of the submitted answers for the systems
that outperformed the baselines (best-performing system per team). We see a clear trimodal
distribution with peaks around 0, 0.5 and 1, respectively. We noticed that systems submitted
“bolder” answers than last year, i.e., only few answers lie in between the three peaks. The
middle peak around 0.5 leads to the assumption that some systems deliberately optimized for
non-responses. This assumption is further supported by Figure 7.1.1 (right), which shows the
same observation, but broken down by individual systems.
In Figure 2, we plot the precision-recall curves for the above-mentioned submissions, including
that of a naive meta classifier that predicts the mean score over all systems (dotted line). Whereas
in previous years, the meta-classifier often suffered from a lack of methodological diversity in
participant systems, this year, the mean verification score outperforms most individual systems.
Nevertheless, while the meta classifier can compete with boenninghoff21 in terms of precision,
it clearly falls short with regards to recall.4
4
Meta classifier performance: AUC: 0.917, c@1: 0.917, F1 : 0.916, F0.5𝑢 : 0.919, Brier: 0.917, Overall: 0.917.
boenningshoff21-large
3.0 embarcaderoruiz21-large
weerasinghe21-large
2.0 menta21-large
2.5 peng21-large
rabinovits21-large
ikae21-large
2.0 1.5
Density

1.5
1.0
1.0
0.5
0.5

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Verification Score Verification Score

Figure 1: Left: Kernel density estimate across all answer scores submitted. Limited to highest ranking
system per team which outperformed the baselines. Right: Same as left plot, but broken down by
system.

1.0

0.9

0.8
Precision

boenninghoff21-large
0.7 embarcaderoruiz21-large
weerasinghe21-large
menta21-large
0.6 peng21-large
rabinovits21-small
ikae21-small
meta (mean)
0.5
0.5 0.6 0.7 0.8 0.9 1.0
Recall
Figure 2: Precision-recall curves for individual systems, as well as a meta classifier that is based on
the mean verification score across systems. Limited to highest ranking system per team which outper-
formed the baselines.

7.1.2. Non-answers
Non-answers were an integral aspect of the evaluation procedure. In the submitted scores,
but also in the participants’ notebooks, we observed that particularly returning participants,
boenninghoff21-large

weerasinghe21-large peng21-small embarcaderoruiz21-large
weerasinghe21-small
0.9 menta21-large
embarcaderoruiz21-small

menta21-small

0.8 rabinovits21-small
unmasking21-small
tyo21-large ikae21-small
c@1 score

compressor21-small
naive21-small
0.7

futrzynski21-large

0.6

liaozhihao21-small
0.5

-2000 0 2000 4000 6000 8000
Number of non-responses

Figure 3: The c@1 scores per system as a function of the absolute number of non-answers.

such as boenninghoff21 and ikae21 took greater care to fine-tune this aspect of their systems
(and were indeed successful in doing so). The different systems used non-responses to varying
degrees. In Figure 3, we plot the c@1 performance as a function of the absolute number of
non-responses per system. We see that futrzynski21 returned overall the most non-responses,
though at the cost of a below-baseline performance. The three baselines, too, gave non-answers
in comparably many cases, but were convincingly outperformed by most participant systems.
The top-performing systems (boenninghoff21, embarcaderoruiz21) refused to answer cases to
a more moderate degree, resulting in an overall very good performance. Many of the other
high-ranking systems, such as weerasinghe21, menta21, or peng21, appeared as if they did
not pay particular attention to optimizing for this aspect of the task and submitted only very
few non-answers, if any. We performed a paired, non-parametric Wilcoxon signed-rank test
(𝑛 = 16) to assess whether the number of non-responses of a system (including the baselines)
correlated positively with its c@1 score. The result (𝑊 = 28.0; 𝑝 = 0.019) offers some ground
to accept this positive correlation and thus supports the hypothesis that it generally paid off for
systems to submit non-answers for difficult cases.
Like last year, these observations raise the question as to which extent boenninghoff21’s
Table 4
Evaluation results for top-performing systems (one per team), excluding any test problems for which
boenninghoff21-large submitted a non-response (𝑎𝑖 = 0.5).

Team Dataset AUC c@1 F1 F0.5𝑢 Brier Overall
boenninghoff21 large 0.991 0.957 0.952 0.976 0.963 0.968
embarcaderoruiz21 large 0.977 0.946 0.947 0.927 0.942 0.948
weerasinghe21 large 0.979 0.934 0.930 0.932 0.944 0.944
menta21 large 0.971 0.920 0.913 0.926 0.930 0.932
peng21 small 0.929 0.929 0.925 0.925 0.929 0.928

competitive edge can be attributed to the system’s ability to correctly identify such difficult
cases in order to leave them unanswered. Table 4 summarizes the performances of the top
systems (one per participant) on all cases on which boenninghoff21 submitted a score of
𝑎𝑖 ̸= 0.5. Interestingly, the differences in performance stay the same, as well as the ranking,
which indicates that the treatment of difficult cases is not the only magic ingredient (we
should emphasize boenninghoff21’s exceptional F0.5𝑢 score on this subset; indicating that they
primarily backed off for different-author document pairs).

7.1.3. The Influence of Topic (continued)
In last year’s overview paper, we applied a generic topic model to analyze the test problems
from a semantic perspective. To avoid repetition, we will not reintroduce this model (non-
negative matrix factorization with 150 dimensions applied to a TF-IDF-normalized bag-of-words
representation of content words) at length, but it remains an interesting challenge to analyze this
year’s test data from the same topical perspective. We applied the same pipeline to this year’s
test data for assessing topic similarities between the document pairs, in which we calculated
the cosine similarity between the L1-normalized topic vectors for each document. Overall,
the topical distances over all the document pairs in both the 2020 and 2021 test sets show a
very similar distribution (2020: 𝜇 = 0.656, 𝜎 = 0.147; 2021: 𝜇 = 0.641, 𝜎 = 0.153). This is
reassuring, as it show that while both datasets are cross-fandom, the open-set vs. closed-set
reformulation did not introduce any obvious topical artifacts.
Generally speaking, all of the trends reported last year also hold on this year’s test set:

1. Same-author pairs displayed a higher topical similarity then different-author pairs, indi-
cating that authors do have an inclination to write about the same topics (see Figure 4
(left)). A non-parametric (one-sided, but unpaired) Mann-Whitney 𝑈 test (𝑛1 = 10, 000,
𝑛2 = 9, 999) lends support to this view (𝑈 > 68, 687𝐾, 𝑝 < 0.001).
2. There is a mild but real correlation between the topical similarity of a document pair in a
test problem and the average verification score submitted by systems.
3. Results for the standard linear regression model reported last year were: 𝛽 = 0.16,
𝑅2 = 0.15. When limited to the correctly answered cases of the meta classifier, the
resulting model this year is comparable (𝛽 = 0.16, 𝑅2 = 0.15), but for the incorrect
predictions, the coefficients markedly drop (𝛽 = 0.09, 𝑅2 = 0.01).
meta-clf correct
1.0 False
True
0.8
Topic similarity

0.6

0.4

0.2

0.0

False True False True
Same author Same author
Figure 4: Left: Distribution of topical similarity, separate for same-author and different-author pairs.
Right: The distribution of topical similarity within document pairs in the test set for same-author and
different-author pairs broken down by whether the meta-classifier answered the pairs correctly.

All in all, we can hypothesize that this year again, the models were generally susceptible
to a misleading influence of topic similarity, as indicated by Figure 4 (right): Correctly solved
different-author pairs tended to be of lower topical similarity than those answered incorrectly.
Same as last year, this relationship was reversed for the same-author pairs. Thus, topical
information can be very useful for authorship verification, but it cannot necessarily be taken at
face value.

8. Outlook
Last year’s edition proved to be a turning point in the history of the authorship identification
track at PAN: Through the release of large-scale calibration materials, the performance of
authorship models could, for the first time, be benchmarked on a scale sufficient for deep
representation learning. This stimulated the adoption of new neural models which produced
competitive and, in some cases, outstanding results. Interestingly, this size increase did attract
new participants, while at the same time, some of the regular participants from previous years
found the sudden increase in data size rather intimidating and struggled in the adaptation
of their pre-existing systems to the new data. To counteract this effect and to maximize the
inclusivity of our initiative, the separate submission of systems trained on the small and on the
large dataset variant was introduced.
Another critical change compared to previous installments was the fact that the new dataset
was limited to English-language documents only, a mere result of the availability of the source
material. While we assume that most systems would also generalize to other (at least European)
languages, we are aware that this might be a potential source of bias and it remains to be seen
to which extent exactly the results reported here will be reproducible in other (more heavily
inflected) languages. Also, the effect of (potentially very many) non-native speakers of English
that appear as authors in the data is hard to quantify at this time. To the best of our knowledge,
very few studies have looked at authorship identification across different writing languages.
One might hypothesize that authors, when active in their native language, will demonstrate
greater mastery and diversity of style, while in a second language, less refined writing and
typical errors might increase their identifiability. Another deserving field for future studies is the
comparison of fanfiction material that was exclusively written by authors who self-identify as
(non-professional) “fans”—hence received very little (if any) moderation or editing—to writing
samples by professional authors.
In spite of these critical remarks, the central take-away message from this year’s shared task
remains positive: Modern, large-scale authorship verification systems can perform extremely
well within the fanfiction domain. Contrary to our expectations, recasting last year’s task
as an open-set setup did not degrade, but in fact improve their performance. Most systems
were more than capable to accurately answer the cases, even though none of the authors and
fandoms were seen in the training data. This is highly encouraging, though it remains to
be seen whether this holds true for other textual domains outside of transformative fiction.
In light of the outstanding results, we should certainly raise the uncomfortable question of
whether cross-domain authorship verification in the fanfiction domain is simply too easy.
Perhaps the variance between different fandoms is limited (e.g., due to a focus on erotic and
pornographic content [32]) and should thus not be taken as a proxy for domain differences in
other text varieties. Nevertheless, the findings demonstrate that the issue of the ad-hoc nature
of authorship identification can be overcome, at least within a single textual domain, which is
certainly a positive and encouraging message.

Acknowledgements
As in previous years, this initiative would not have been possible without the generous contri-
butions of the participating teams, whose patience and enthusiasm we wish to acknowledge
in what has been an unusually trying edition. Our thanks also go to the CLEF organizers for
the continuation of their hard annual work. Finally, we would like to extend our apprecia-
tion to Sebastian Bischoff, Niklas Deckers, Marcel Schliebs, and Ben Thies for assembling the
fanfiction.net corpus.

References
[1] H. van Halteren, H. Baayen, F. Tweedie, M. Haverkort, A. Neijt, New machine learning methods
demonstrate the existence of a human stylome, Journal of Quantitative Linguistics 12 (2005)
65–77. doi:10.1080/09296170500055350.
[2] E. Stamatatos, A survey of modern authorship attribution methods, JASIST 60 (2009) 538–556.
URL: https://doi.org/10.1002/asi.21001. doi:10.1002/asi.21001.
[3] P. Juola, Authorship attribution, Foundations and Trends in Information Retrieval 1 (2006)
233–334.
[4] M. Koppel, J. Schler, S. Argamon, Computational methods in authorship attribution, Journal of
the American Society for Information Science and Technology 60 (2009) 9–26.
[5] M. Potthast, S. Braun, T. Buz, F. Duffhauss, F. Friedrich, J. M. Gülzow, J. Köhler, W. Lötzsch,
F. Müller, M. E. Müller, R. Paßmann, B. Reinke, L. Rettenmeier, T. Rometsch, T. Sommer,
M. Träger, S. Wilhelm, B. Stein, E. Stamatatos, M. Hagen, Who wrote the web? revisiting
influential author identification research applicable to information retrieval, in: N. Ferro,
F. Crestani, M.-F. Moens, J. Mothe, F. Silvestri, G. M. Di Nunzio, C. Hauff, G. Silvello (Eds.),
Advances in Information Retrieval, Springer International Publishing, Cham, 2016, pp. 393–407.
[6] J. Bevendorff, B. Ghanem, A. Giachanou, M. Kestemont, E. Manjavacas, M. Potthast, F. Rangel,
P. Rosso, G. Specht, E. Stamatatos, B. Stein, M. Wiegmann, E. Zangerle, Shared tasks on
authorship analysis at PAN 2020, in: J. M. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J.
Silva, F. Martins (Eds.), Advances in Information Retrieval, Springer International Publishing,
Cham, 2020, pp. 508–516.
[7] M. Kestemont, E. Manjavacas, I. Markov, J. Bevendorff, M. Wiegmann, E. Stamatatos, M. Potthast,
B. Stein, Overview of the cross-domain authorship verification task at PAN 2020, in:
L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference
and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of
CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL:
http://ceur-ws.org/Vol-2696/paper_264.pdf.
[8] M. Kestemont, M. Tschuggnall, E. Stamatatos, W. Daelemans, G. Specht, B. Stein, M. Potthast,
Overview of the author identification task at PAN-2018: cross-domain authorship attribution and
style change detection, in: Working Notes Papers of the CLEF 2018 Evaluation Labs. Avignon,
France, September 10-14, 2018/Cappellato, Linda [edit.]; et al., 2018, pp. 1–25.
[9] M. Kestemont, E. Stamatatos, E. Manjavacas, W. Daelemans, M. Potthast, B. Stein, Overview of
the Cross-domain Authorship Attribution Task at PAN 2019, in: L. Cappellato, N. Ferro,
D. Losada, H. Müller (Eds.), CLEF 2019 Labs and Workshops, Notebook Papers, CEUR-WS.org,
2019. URL: http://ceur-ws.org/Vol-2380/.
[10] K. Hellekson, K. Busse (Eds.), The Fan Fiction Studies Reader, University of Iowa Press, 2014.
[11] J. Fathallah, Fanfiction and the Author. How FanFic Changes Popular Cultural Texts, Amsterdam
University Press, 2017.
[12] G. W. Brier, et al., Verification of forecasts expressed in terms of probability, Monthly weather
review 78 (1950) 1–3.
[13] A. Peñas, A. Rodrigo, A simple measure to assess non-response, in: Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies - Volume 1, HLT ’11, Association for Computational Linguistics, USA, 2011, p.
1415–1424.
[14] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Generalizing unmasking for short texts, in:
J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and
Short Papers), Association for Computational Linguistics, 2019, pp. 654–659. URL:
https://doi.org/10.18653/v1/n19-1068. doi:10.18653/v1/n19-1068.
[15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning
Research 12 (2011) 2825–2830.
[16] O. Halvani, L. Graner, Cross-domain authorship attribution based on compression: Notebook for
PAN at CLEF 2018, in: L. Cappellato, N. Ferro, J. Nie, L. Soulier (Eds.), Working Notes of CLEF
2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018,
volume 2125 of CEUR Workshop Proceedings, CEUR-WS.org, 2018. URL:
http://ceur-ws.org/Vol-2125/paper_90.pdf.
[17] M. Kestemont, J. A. Stover, M. Koppel, F. Karsdorp, W. Daelemans, Authenticating the writings of
julius caesar, Expert Systems with Applications 63 (2016) 86–96. URL:
https://doi.org/10.1016/j.eswa.2016.06.029. doi:10.1016/j.eswa.2016.06.029.
[18] M. Koppel, J. Schler, Authorship verification as a one-class classification problem, in: C. E.
Brodley (Ed.), Machine Learning, Proceedings of the Twenty-first International Conference
(ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, volume 69 of ACM International Conference
Proceeding Series, ACM, 2004. doi:10.1145/1015330.1015448.
[19] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Generalizing Unmasking for Short Texts, in:
J. Burstein, C. Doran, T. Solorio (Eds.), 14th Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies (NAACL 2019),
Association for Computational Linguistics, 2019, pp. 654–659. URL:
https://www.aclweb.org/anthology/N19-1068.
[20] C. Ikae, UniNE at PAN-CLEF 2021: Author verification, in: G. Faggioli, N. Ferro, A. Joly,
M. Maistro, F. Piroi (Eds.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org,
2021.
[21] A. Menta, A. Garcia-Serrano, Authorship verification with neural networks via stylometric
feature concatenation, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021
Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021.
[22] Z. Liao, Z. Hong, Z. Li, G. Liang, Z. Mo, Z. Li, Authorship verification of language models based
on Lucene architecture, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021
Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021.
[23] J. Weerasinghe, R. Singh, R. Greenstadt, Feature vector difference based authorship verification
for open world settings, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021
Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021.
[24] B. Boenninghoff, R. M. Nickel, D. Kolossa, O2D2: Out-of-distribution detector to capture
undecidable trials in authorship verification, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi
(Eds.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021.
[25] Z. Peng, L. Kong, Z. Zhang, Z. Han, X. Sun, Encoding text information by pre-trained model for
authorship verification, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021
Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021.
[26] R. Futrzynski, Author classification as pre-training for pairwise authorship verification, in:
G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021 Labs and Workshops,
Notebook Papers, CEUR-WS.org, 2021.
[27] D. Embarcadero-Ruiz, H. Gómez-Adorno, I. Reyes-Hernández, A. García, A. Embarcadero-Ruiz,
Graph-based Siamese network for authorship verification, in: G. Faggioli, N. Ferro, A. Joly,
M. Maistro, F. Piroi (Eds.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org,
2021.
[28] J. Tyo, B. Dhingra, Z. Lipton, Siamese Bert for authorship verification, in: G. Faggioli, N. Ferro,
A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021 Labs and Workshops, Notebook Papers,
CEUR-WS.org, 2021.
[29] M. Pinzhakova, T. Yagel, J. Rabinovits, Feature similarity-based regression models for authorship
verification, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021 Labs and
Workshops, Notebook Papers, CEUR-WS.org, 2021.
[30] D. Chicco, Siamese Neural Networks: An Overview, Springer US, New York, NY, 2021, pp. 73–94.
URL: https://doi.org/10.1007/978-1-0716-0826-5_3. doi:10.1007/978-1-0716-0826-5_3.
[31] E. W. Noreen, Computer-Intensive Methods for Testing Hypotheses: An Introduction, A
Wiley-Interscience publication, 1989.
[32] G. Barlas, E. Stamatatos, A transfer learning approach to cross-domain authorship attribution,
Evolving Systems (2021). URL: https://link.springer.com/10.1007/s12530-021-09377-2.
doi:10.1007/s12530-021-09377-2.