=Paper= {{Paper |id=Vol-2380/paper_264 |storemode=property |title=Overview of the Cross-domain Authorship Attribution Task at PAN 2019 |pdfUrl=https://ceur-ws.org/Vol-2380/paper_264.pdf |volume=Vol-2380 |authors=Mike Kestemont,Efstathios Stamatatos,Enrique Manjavacas,Walter Daelemans,Martin Potthast,Benno Stein |dblpUrl=https://dblp.org/rec/conf/clef/KestemontSMDPS19 }} ==Overview of the Cross-domain Authorship Attribution Task at PAN 2019== https://ceur-ws.org/Vol-2380/paper_264.pdf

Overview of the Cross-Domain Authorship Attribution
Task at PAN 2019

Mike Kestemont,1 Efstathios Stamatatos,2 Enrique Manjavacas,1
Walter Daelemans,1 Martin Potthast,3 and Benno Stein4
1
University of Antwerp, Belgium
2
University of the Aegean, Greece
3
Leipzig University, Germany
4
Bauhaus-Universität Weimar, Germany

pan@webis.de https://pan.webis.de

Abstract. Authorship identification remains a highly topical research problem
in computational text analysis, with many relevant applications in contemporary
society and industry. In this edition of PAN, we focus on authorship attribution,
where the task is to attribute an unknown text to a previously seen candidate au-
thor. Like in the previous edition we continue to work with fanfiction texts (in
four Indo-European languages), written by non-professional authors in a cross-
domain setting: the unknown texts belong to a different domain than the train-
ing material that is available for the candidate authors. An important novelty of
this year’s setup is the focus on open-set attribution, meaning that the test texts
contain writing samples by previously unseen authors. For these, systems must
consequently refrain from an attribution. We received altogether 12 submissions
for this task, which we critically assess in this paper. We provide a detailed com-
parison of these approaches, including three generic baselines.

1 Cross-Domain Authorship Attribution
Authorship attribution [1,2,3] continues to be an important problem in information re-
trieval and computational linguistics, and also in applied areas such as law and jour-
nalism where knowing the author of a document (such as a ransom note) may enable
law enforcement to save lives. The most common framework for testing candidate al-
gorithms is the closed-set attribution task: given a sample of reference documents from
a finite set of candidate authors, the task is to determine the most likely author of a pre-
viously unseen document of unknown authorship. This task is quite challenging under
cross-domain conditions where documents of known and unknown authorship come
from different domains (such as a different thematic area or genre). In addition, it is
often more realistic to assume that the true author of a disputed document is not neces-
sarily included in the list of candidates [4].
Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
Switzerland.
This year, we again focus on the attribution task in the context of transformative
literature, more colloquially know as ‘fanfiction’. Fanfiction refers to a rapidly expand-
ing body of fictional narratives, typically produced by non-professional authors who
self-identify as ‘fans’ of a particular oeuvre or individual work [5]. Usually, these sto-
ries (or ‘fics’) are openly shared online with a larger fan community on platforms such
as fanfiction.net or archiveofourown.org. Interestingly, fanfiction is the
fastest growing form of writing in the world nowadays [6]. When sharing their texts,
fanfiction writers explicitly acknowledge taking inspiration from one or more cultural
domains that are known as ‘fandoms’. The resulting borrowings take place on various
levels, such as themes, settings, characters, story world, and also style. Fanfiction usu-
ally is of an unofficial and unauthorized nature [7], but because most fanfiction writers
do not have any commercial purposes, the genre falls under the principle of ‘Fair Use’
in many countries [8]. From the perspective of writing style, fanfiction offers valuable
benchmark data: the writings are unmediated and unedited before publication, which
means that they should accurately reflect an individual author’s writing style. More-
over, the rich metadata available for individual fics presents opportunities to quantify
the extent to which fanfiction writers have modeled their writing style after the original
author’s style [9].
In the previous edition of PAN we also dealt with authorship attribution in fanfiction
and added extra difficulty with a cross-domain condition (i.e., different fandoms). This
year we have further increased the difficulty of the task by focusing on open-set attri-
bution conditions, meaning that the true author of a test text is not necessarily included
in the list of candidate authors. More formally, an open cross-domain authorship attri-
bution problem can be expressed as a tuple (A, K, U ), with A as the set of candidate
authors, K as the set of reference (known authorship) texts, and U as the set of unknown
authorship texts. For each candidate author a ∈ A, we are given Ka ⊂ K, a set of texts
unquestionably written by author a. Each text in U should either be assigned to exactly
one a ∈ A, or the system should refrain from an attribution if the target author is not
supposed to be in A. From a text categorization point of view, K is the training corpus
and U is the test corpus. Let DK be (the names of) the fandoms of the texts in K. Then,
all texts in U belong to a single fandom dU , with dU ∈ / DK .

2 Datasets
This year’s shared tasks use datasets in four major Indo-European languages: English,
French, Italian, and Spanish. For each language, 10 “problems” were constructed on
the basis of a larger dataset obtained from archiveofourown.org in 2017. Per
language, five problems were released as a development set to the participants in order
to calibrate their systems. The final evaluation of the submitted systems was carried
out on the five remaining problems that were not publicly released before the final re-
sults were communicated. Each problem had to be solved independently from the other
problems. It should be noted that the development material could not be used as mere
training material for supervised learning approaches since the candidate authors of the
development corpus and the evaluation corpus do not overlap. Therefore, approaches
should not be designed to particularly handle the candidate authors of the development
corpus but should focus on their generalizability to other author sets.
One problem corresponds to a single open-set attribution task, where we distinguish
between the “source” and the “target” material. The source material in each problem
contains exactly 7 training texts for exactly 9 candidate authors. In the target material,
these 9 authors are represented by at least one test text (possibly more). Additionally,
the target material also contains so-called “adversaries”, which were not written by one
of the candidate authors (indicated by the author label “”). The proportion of
the number of target texts written by the candidate authors in problems, as opposed
to documents, was varied across the problems in the development dataset, in
order to discourage systems from opportunistic guessing.
Let UK be the subset of U that includes all test documents actually written by the
candidate authors, and let UU be the subset of U containing the rest of the test docu-
ments not written by any candidate author. Then, the adversary ratio r = |UU |/|UK |
determines the likelihood of a test document to belong to one of the candidates. If r = 0
or close to 0), the problem is essentially a closed-set attribution scenario since all test
documents belong to the candidate authors, or very few are actually written by adver-
saries. If r = 1, then it is equally probable for a test document to be written by a
candidate author or by another author. For r > 1 it is more likely for a test document to
be written by an adversary not included in the list of candidates.
We examine cases where r ranges from 0.2 to 1.0. In greater detail, as can be seen
in Table 1, the development dataset comprises 5 problems per language that correspond
to r = [0.2, 0.4, 0.6, 0.8, 1.0]. This dataset was released for the participants in order
to develop and calibrate their submissions. The final evaluation dataset also includes
5 problems per language but with fixed r = 1. Thus, the participants are implicitly
encouraged to develop generic approaches, because of the varying likelihood that a
test document is written by a candidate or an adversary. In addition, it is possible to
estimate the effectiveness of submitted methods when r < 1 by ignoring their answers
for specific subsets of UU in the evaluation dataset.
Each of the individual texts belongs to a single fandom, i.e., a certain topical do-
main. Fandoms were made available in the training material so that systems could ex-
ploit this information, as done by Seroussi et al. [10] for instance. We only selected
works counting at least 500 tokens (according to the original database’s internal token
count), which is already a challenging text length for authorship analyses. Finally, we
normalized document length: for fics longer than 1 000 tokens, we only included the
middle 1 000 tokens of the text.
Another novelty this year was the inclusion of a set of 5 000 problem-external doc-
uments per language written by “imposter” authors (the authorship of these texts is
also encoded as ). These documents could be freely used by the participants
to develop their systems, for instance in the popular framework of imposter verifica-
tion [4]. The release of this additional corpus should be understood against the back-
drop of PAN’s benchmarking efforts to improve the reproducibility and comparability
of approaches. Many papers using imposter-based approaches do so on ad-hoc collected
corpora, making it hard to compare the effect of the composition of the imposter col-
lection. The imposter collection is given for the language as a whole and is thus not
Table 1. Details about the fanfiction datasets built for the cross-domain authorship attribution
task. |A| refers to the size of candidates list, |Ka | is the amount of training documents per au-
thor a, a ∈ A, |U | is the amount of test documents, r is the adversary ratio, and l denotes the
average length of the documents in words.

Language # Problems |A| |Ka | |U | r l
Development

English 5 9 7 137-561 0.2 - 1.0 804
French 5 9 7 38-430 0.2 - 1.0 790
Italian 5 9 7 46-196 0.2 - 1.0 814
Spanish 5 9 7 112-450 0.2 - 1.0 846

English 5 9 7 98-180 1.0 817
Evaluation

French 5 9 7 48-290 1.0 790
Italian 5 9 7 34-302 1.0 821
Spanish 5 9 7 172-588 1.0 838

problem-specific. We provide the information that these texts were not written by au-
thors who appear in the source or target sets for the problems in the language. When
selecting these texts from the base dataset, we have given preference to texts from the
fandoms covered in the problems, but when this selection was smaller than 5 000 texts,
we have completed it with a random selection of other texts.

3 Evaluation Framework
In the final evaluation phase, submitted systems were presented with 5 problems per
language: for each problem, given a set of documents (known fanfics) by candidate
authors, the systems had to identify the authors of another set of documents (unknown
fanfics) in a previously unencountered fandom (target domain). Systems could assume
that each candidate author had contributed at least one of the unknown fanfics to the
problem, which all belonged to the same target fandom. Some of the fanfics in the
target domain, however, were not written by any of the candidate authors. Like in the
calibration set, the known fanfics belonged to several fandoms (excluding the target
fandom), although not necessarily the same for all candidate authors. An equal number
of known fanfics per candidate author was provided: 7 fanfics for 9 authors. By contrast,
the unknown fanfics were not equally distributed over the authors.
The submissions were separately evaluated in each attribution problem, based on
their open-set macro-averaged F1 score (calculated over the training classes, i.e., when
is excluded) [11]. Participants were ranked according to their average open-set
macro-F1 across all attribution problems of the evaluation corpus. A reference imple-
mentation of the evaluation script was made available to the participants.

3.1 Baseline Methods
As usual, we provide the implementation of three baseline methods that provide an es-
timation of the overall difficulty of the problem given the state of the art in the field.
These implementations are in Python (2.7+) and rely on Scikit-learn and its base pack-
ages [12,13] as well as NLTK [14]. Participants were free to base their approach on one
of these reference systems, or to develop their own approach from scratch. The provided
baseline are as follows:
1. BASELINE - SVM. A language-independent authorship attribution approach, framing
attribution as a conventional text classification problem [15]. It is based on a char-
acter 3-gram representation and a linear SVM classifier with a reject option. First,
it estimates the probabilities of output classes based on Platt’s method [16]. Then,
it assigns an unknown document to the class when the difference of the
probabilities of the top two candidates is less than a predefined threshold. Let a1
and a2 , a1 , a2 ∈ A, be the two most likely authors of a certain test document while
P r1 and P r2 are the corresponding estimated probabilities (i.e., all other candi-
dates obtained lower probabilities). Then, if P r1 − P r2 < 0.1, the document is left
unattributed. Otherwise it is attributed to a1 .
2. BASELINE - COMPRESSOR. A language-independent approach that uses text com-
pression to estimate the distance of an unknown document to each of the candidate
authors. This approach was originally proposed by [17] and was later reproduced
by [18]. It uses the Prediction by Partial Matching (PPM) compression algorithm
to build a model for each candidate author. Then, it calculates the cross-entropy
of each test document with respect to the model of each candidate and assigns the
document to the author with the lowest score. In order to adapt this method to the
open-set classification scenario, we introduced a reject option. In more detail, a
test document is left unattributed when the difference between the two most likely
candidates is lower than a predefined threshold. Let a1 and a2 , a1 , a2 ∈ A, be the
two most likely candidate authors for a certain test document while S1 and S2 are
their cross-entropy scores (i.e., all other candidate authors have higher scores). If
(S1 − S2 )/S1 < 0.01, then the test document is left unattributed. Otherwise, it is
assigned to a1 .
3. BASELINE - IMPOSTERS. Implementation of the language-independent imposters
approach for authorship verification [4,19], based on character tetragram features.
During a bootstrapped procedure, the technique iteratively compares an unknown
text to each candidate author’s training profile, as well as to a set of imposter doc-
uments, on the basis of a randomly selected feature subset. Then, the number of
times the unknown document is found more similar to the candidate author’s docu-
ments rather than to the imposters indicates how likely it is for that candidate to be
the true author of the document. Instead of performing this procedure separately for
each candidate author, we examine all candidate authors within each iteration (i.e.,
in each iteration, a maximum of one candidate author’s score is increased). If after
this repetitive process the highest score (corresponding to the most likely author)
does not pass a fixed similarity threshold (here: 10% of repetitions), the document
is assigned to the class and is left unattributed. This baseline method is the
only one that uses additional, problem-external imposter documents. We provided a
collection of 5 000 imposter documents (fanfics on several fandoms) per language.
Finally, we also compare the participating systems to a plain “majority” baseline:
through a simple voting procedure with random tie breaking, this baseline accepts a
candidate for a given unseen text if the majority of submitted methods agree on it;
otherwise, the label is predicted. No meta-learning is applied to weigh the
importance of the votes of individual systems.

4 Survey of Submissions
In total, 12 methods were submitted to the task and evaluated using the TIRA exper-
imentation framework. All but one (Kipnis) of the submissions are described in the
participants’ notebook papers. Table 5 presents an overview of the main characteris-
tics of the submitted methods as well as the baselines. We also record whether ap-
proaches made use of the language-specific imposter material or language-specific NLP
resources, such as pretrained taggers and parsers. As can be seen, there is surprisingly
little variance in the approaches. The majority of submissions follow the paradigm of
the BASELINE-SVM or the winner approach [20] of PAN-2018 cross-domain author-
ship attribution task [21], which is an ensemble of classifiers each of which based on a
different text modality.
Compared to the baselines, most submitted methods attempt to exploit richer in-
formation that corresponds to different text modalities as well as variable-length n-
grams [22] in contrast to fixed-length n-grams [23]. The most popular features are
n-grams extracted from plain text modalities, such as character, word, token, part-of-
speech tag, or syntactic level sequences. Given the cross-domain conditions of the task,
several participants attempted to use more abstract forms of textual information such
as punctuation sequences [24,22,25] or n-grams extracted from distorted versions [26]
of the original documents [27,22,25]. There is limited effort to enrich n-gram features
with alternative stylometric measures like word and sentence length distributions [28]
or features related to syntactic analysis of documents [24,29]. Only one participant
used word embeddings [30]. Other participants report that they were discouraged to use
more demanding types of word and sentence embeddings due to hardware limitations
of TIRA [31], which points to important infrastructural needs that may be addressed in
future editions. Those same teams, however, informally reported that the difference in
performance (in the development dataset) when using such additional features is negli-
gible.
With respect to feature weighting, tf-idf is the most popular option while the base-
line methods are based on the simpler tf scheme. There is one attempt to use both of
these schemes [30]. A quite different approach uses a normalization scheme based on z-
scores [29]. In addition, a few methods apply dimension reduction (PCA, SVD) to the
features [32,33,22]. Judging from the results, such methods for dimension reduction
have the potential to boost performance.
As concerns the classifiers, the most popular choices are SVMs and ensembles of
classifiers, usually exploiting SVM base models followed by Logistic Regression (LR)
models. In a few cases, the participants informally report that they have experimented
with alternative classification algorithms (random forests, k-nn, naive Bayes) and found
that SVM and LR are the most effective classifiers for this kind of task [27,30]. None
of the participant’s methods is based on deep learning algorithms, most probably due
to hardware limitations of TIRA or because of the discouraging reported results in the
corresponding task of PAN-2018 [21].
Given the fact that the focus of PAN-2019 edition of the task is on open-set attri-
bution, it can be noted that none of the participants attempted to build a pure open-set
classifier [34]. By contrast, they just use closed-set classifiers with a reject option (the
classification prediction is dropped when the confidence of prediction is low), similar
to the baseline methods [35].
A crucial issue to improve the performance of authorship attribution is the appro-
priate tuning of hyperparameters. Most of the participants tune the hyperparameters of
their approach globally based on the development dataset, that is, they estimate the most
suitable parameter values that are applied to any problem of the test dataset. In contrast,
a few participants attempt to tune the parameters of their method in a language-specific
way, estimating the most suitable values for each language separately. None of the sub-
mitted methods attempts to tune parameter values for each individual attribution prob-
lem.
The submission of van Halteren focuses on the cross-domain difficulty of the task
and attempts to exploit the availability of multiple texts of unknown authorship in the
target domain within each attribution problem [29]. This submission performs a sophis-
ticated strategy composed of different phases. Initially, a typical cross-domain classifier
is built and each unknown document is assigned to its most likely candidate author but
the prediction is kept only for the most confident cases. Then, a new in-domain clas-
sifier is built using the target domain documents (for which the predictions were kept
in the previous phase) and the remaining target domain documents are classified ac-
cordingly. However, this in-domain classifier can only be useful for certain candidate
authors, the ones with enough confident predictions in the initial phase. A final phase
combines the results of cross-domain and in-domain classifiers and leaves documents
with less confident predictions unattributed.

5 Evaluation Results

Table 2 shows an overview of the evaluation results of participants and their ranking
according to their macro-F1 (averaged across all attribution problems of the dataset).
As can be seen, all but one submission surpass the three baseline methods. In gen-
eral, the submitted methods and the baselines achieve better macro-recall than macro-
precision—which, interestingly, is not the case for the more prudent majority baseline.
The two top-performing submissions obtain a very similar macro-F1 score. However,
the winning approach of Muttenthaler et al. has better macro-precision while Bacciu et
al. achieve better macro-recall. In terms of elapsed runtime, the winning approach of
Muttenthaler et al. also proved to be a very efficient one.
Table 3 demonstrates the effectiveness (averaged macro-F1) of the submitted meth-
ods for each one of the four languages of the evaluation dataset. The winning approach
of Muttenthaler et al. is more effective in English and French while the approach of
Bacciu et al. achieves comparable performance in Italian and Spanish. In general, the
variation of top-performing approaches across the four languages is low. On average,
the highest performance is obtained for attribution problems in Italian; English proved
Table 2. The final evaluation results of the cross-domain authorship attribution task. Participants
and baselines are ranked according to macro-F1.

Submission Macro-Precision Macro-Recall Macro-F1 Runtime
Muttenthaler et al. 0.716 0.742 0.690 00:33:17
MAJORITY 0.748 0.708 0.686
Bacciu et al. 0.688 0.768 0.680 01:06:08
Custodio & Paraboni 0.664 0.717 0.65 01:21:13
Bartelds & de Vries 0.657 0.719 0.644 11:19:32
Rodríguez et al. 0.651 0.713 0.642 01:59:17
Isbister 0.629 0.706 0.622 01:05:32
Johansson 0.593 0.734 0.616 01:05:30
Basile 0.616 0.692 0.613 00:17:08
van Halteren 0.590 0.734 0.598 37:05:47
Rahgouy et al. 0.601 0.633 0.580 02:52:03
Gagala 0.689 0.593 0.576 08:22:33
BASELINE-SVM 0.552 0.635 0.545
BASELINE-COMPRESSOR 0.561 0.629 0.533
BASELINE-IMPOSTERS 0.428 0.580 0.395
Kipnis 0.270 0.409 0.259 20:20:21

to be the most difficult case. It is also remarkable that the baseline-compressor method
achieves the best baseline results for English, French, and Italian, but it is not as com-
petitive in Spanish. Furthermore, note that Muttenthaler et al.’s submission is the only
one to outperform the otherwise very competitive majority baseline, albeit by a very
small margin. The latter reaches a relatively high precision, but must sacrifice quite a
bit of recall in return. That the winner outperforms the majority baseline is surprising:
in previous editions of this shared task (e.g. [36]), similar meta-level approaches proved
very hard to beat. This result is probably an artifact of the lack of diversity among the
submissions in the top-scoring cohort, which seem to have produced very similar pre-
dictions (see below), thus reducing the beneficial effects of a majority vote among those
systems.
In order to examine the effectiveness of the submitted methods for a varying adver-
sary ratio, we performed the following additional evaluation process. As can be seen in
Table 1, all attribution problems of the evaluation dataset have a fixed adversary ratio
r = 1, meaning that an equal number of documents written either by the candidate au-
thors or adversary authors is included in the test set of each problem. Once the submitted
methods processed the whole evaluation dataset, we calculated the evaluation measures
with decreasing proportions of adversary documents at 100%, 80%, 60%, 40%, and
20%, resulting in an adversary ratio that ranges from 1 to 0.2. Table 4 presents the
evaluation results (averaged macro-F1) for such a varying adversary ratio. In general,
the performance of all methods increases when the adversary ratio decreases. Recall
that r = 0 corresponds to a closed-set attribution case. The performance of the two
top-performing approaches is very similar in the whole range of examined r-values.
However, the method of Muttenthaler et al. is slightly better for high r-values while
Bacciu et al. is slightly better for low r-values.
Table 3. Results (macro-F1) per language of the cross-domain authorship attribution task. Partic-
ipants and baselines are ranked according to their overall macro-F1.

Submission English French Italian Spanish
Muttenthaler et al. 0.665 0.705 0.717 0.673
Bacciu et al. 0.638 0.689 0.715 0.679
Custodio & Paraboni 0.587 0.686 0.682 0.647
Bartelds & de Vries 0.558 0.687 0.700 0.629
Rodríguez et al. 0.597 0.624 0.696 0.651
Isbister 0.529 0.644 0.691 0.623
Johansson 0.613 0.593 0.655 0.602
Basile 0.555 0.628 0.656 0.613
van Halteren 0.532 0.554 0.653 0.652
Rahgouy et al. 0.550 0.583 0.595 0.592
Gagala 0.554 0.564 0.566 0.619
BASELINE-SVM 0.490 0.548 0.566 0.577
BASELINE-COMPRESSOR 0.493 0.595 0.580 0.464
BASELINE-IMPOSTERS 0.359 0.409 0.410 0.400
Kipnis 0.301 0.232 0.285 0.220

Moreover, we have applied statistical significance tests to the systems’ output. Espe-
cially since many systems have adopted a similar approach, it is worthwhile to discuss
the extent to which submissions show statistically meaningful differences. Like in pre-
vious editions, we have applied approximate randomization testing, a non-parametric
procedure that accounts for the fact that we should not make too many assumptions as
to the underlying distributions for the classification labels. Table 6 lists the results for
pairwise tests, comparing all submitted approaches to each other, based on their respec-
tive F1-scores for all labels in the problems. For 1 000 bootstrapped iterations, the test
returns probabilities which we can interpret as the conventional p-values of one-sided,
statistical tests—i.e., the probability of failing to reject the null hypothesis (H0) that
the classifiers do not output significantly different scores. The symbolic notation takes
into account the following straightforward thresholds: ‘=’ (not significantly different:
p > 0.5), ‘*’ (significantly different: p < 0.05), ‘**’ (very significantly different:
p < 0.01), ‘***’ (highly significantly different: p < 0.001). Interestingly, systems with
neighboring ranks often do not yield significantly different scores; this is also true for
the two top-performing systems. Note that almost all systems have produced an output
that is significantly different from the three baselines (which also display a high degree
of difference among one another). According to this test, the difference between Mut-
tenthaler et al. and Bacciu et al. is not statistically significant, although the former is
significantly different from the majority baseline.
Table 4. Evaluation results (macro-F1) of the cross-domain authorship attribution task for differ-
ent values of the adversary ratio r. Participants and baselines are ranked according to their overall
macro-F1.

r
Submission 100% 80% 60% 40% 20%
Muttenthaler et al. 0.690 0.709 0.727 0.746 0.773
Bacciu et al. 0.680 0.701 0.724 0.749 0.777
Custodio & Paraboni 0.650 0.666 0.686 0.704 0.728
Bartelds & de Vries 0.644 0.663 0.683 0.708 0.736
Rodríguez et al. 0.642 0.664 0.684 0.704 0.733
Isbister 0.622 0.642 0.664 0.685 0.716
Johansson 0.616 0.641 0.666 0.700 0.735
Basile 0.613 0.633 0.654 0.675 0.706
van Halteren 0.598 0.622 0.645 0.672 0.701
Rahgouy et al. 0.580 0.599 0.619 0.642 0.664
Gagala 0.576 0.586 0.597 0.610 0.624
BASELINE-SVM 0.545 0.563 0.585 0.611 0.642
BASELINE-COMPRESSOR 0.533 0.548 0.569 0.592 0.620
BASELINE-IMPOSTERS 0.395 0.409 0.429 0.453 0.484
Kipnis 0.259 0.270 0.285 0.302 0.324
Table 5. Comparison of the core components of the submitted systems.

Language-
Feature transfor-
Participant Features Weighting Parameter tuning Classifier Open-set criterion Use imposters data dependent
mation/selection
resources
n-grams (char,
Bacciu et al. word, POS, stem, tf-idf NA per language Ensemble (SVM) Reject No stemming, POS
distortion)
n-grams (char,
Bartelds and de POS, syntactic
token, POS, tf-idf NA global SVM Reject No
Vries parse
punctuation, ...
n-grams (char and
Basile tf-idf NA global SVM Reject No None
word)
n-grams (char,
Custodio et al. tf-idf PCA global Ensemble (LR) Reject No None
word, distortion), ...
n-grams (char,
Gagala tf-idf PCA global Imposters (LR) Verification Yes None
word)
n-grams (char,
Isbister word), word and tf-idf NA global SVM Reject No None
sentence ...
n-grams (char,
Johansson word, POS, tf-idf NA global SVM Reject No POS
distortion), word ...
n-grams (char,
Muttenthaler et al. word, distortion, tf-idf SVD global Ensemble (SVM) Reject No None
punctuation)
n-grams (char and
Rahgouy et al. word), word tf-idf and tf NA global Ensemble (SVM) Reject No stemming
embeddings
n-grams (char,
Rodríguez et al. typed, punctuation, tf-idf NA global Ensemble (SVM) Reject No None
word)
Ensemble
n-grams (char, POS, syntactic
Van Halteren z-score NA per language (distance-based and Reject No
token, syntactic) parse
SVR)
baseline-SVM n-grams (char) tf NA global SVM Reject No None
baseline-
char sequences none NA global PPM Reject No None
Compressor
baseline-Imposters n-grams (char) tf NA global distance-based Verification Yes None
Table 6. Significance of pairwise differences in output between submissions and across all problems.

Custodio & Paraboni

Baseline-compressor
Bartelds & de Vries

Baseline-impostors
Muttenthaler et al.

Rodríguez et al.

Rahgouy et al.

Baseline-svm
Van Halteren
Bacciu et al.

Johansson
majority

Isbister

Gagala

Kipnis
Basile
Muttenthaler et al. ** = *** *** *** *** *** *** *** *** ** *** *** *** ***
Majority *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Bacciu et al. = *** *** *** *** *** *** *** * *** *** *** ***
Custodio & Paraboni = = ** *** *** *** *** = *** *** *** ***
Bartelds & de Vries = = *** = ** * = *** *** *** ***
Rodríguez et al. * *** *** *** ** = *** *** *** ***
Isbister *** = = = = *** *** *** ***
Johansson ** = ** *** = *** *** ***
Basile = = *** *** *** *** ***
Van Halteren = *** *** *** *** ***
Rahgouy et al. ** *** *** *** ***
Gagala *** *** *** ***
Baseline-svm *** *** ***
Baseline-compressor *** ***
baseline-impostors ***
6 Conclusion

The paper discussed the 12 submissions to the 2019 edition of the PAN shared task on
authorship identification. Like last year, we focused on cross-domain attribution in fan-
fiction data. An important innovation this year was the focus on the open-set attribution
set-up, where participating systems had to be able to refrain from attributing unseen
texts as well. The analyses described above call for a number of considerations that are
not without relevance to future development in the field of computational authorship
identification. First of all, this year’s edition was characterized by a relative low de-
gree of diversity in approaches: especially the higher-scoring cohort almost exclusively
adopted a highly similar approach, involving a combination of SVMs as classifier (po-
tentially as part of an ensembles), character n-grams as features, and a rather simple
thresholding mechanism to refrain from attributions. It is not immediately clear which
directions future research might explore. Deep learning-based methods, which can be
pretrained on external corpora, have so far not led to a major breakthrough in the field,
despite the impressive improvements which have been reported for these methods in
other areas of NLP. Also, a more promising research direction might be to move away
from closed-set classifiers (with a naive reject-option), towards purely open-set classi-
fiers [34]

References
1. M. Koppel, J. Schler, and S. Argamon. Computational methods in authorship attribution.
Journal of the American Society for Information Science and Technology, 60(1):9–26, 2009.
2. P. Juola. Authorship attribution. Foundations and Trends in Information Retrieval,
1(3):233–334, 2006.
3. Efstathios Stamatatos. A survey of modern authorship attribution methods. JASIST,
60(3):538–556, 2009.
4. Moshe Koppel and Yaron Winter. Determining if two documents are written by the same
author. Journal of the Association for Information Science and Technology, 65(1):178–187,
2014.
5. Karen Hellekson and Kristina Busse, editors. The Fan Fiction Studies Reader. University
of Iowa Press, 2014.
6. K. Mirmohamadi. The Digital Afterlives of Jane Austen. Janeites at the Keyboard. Palgrave
MacMillan, 2014.
7. J. Fathallah. Fanfiction and the Author. How FanFic Changes Popular Cultural Texts.
Amsterdam University Press, 2017.
8. R. Tushnet. Legal fictions: Copyright, fan fiction, and a new common law. Loyola of Los
Angeles Entertainment Law Review, 17(3), 1997.
9. Efstathios Stamatatos, Francisco M. Rangel Pardo, Michael Tschuggnall, Benno Stein,
Mike Kestemont, Paolo Rosso, and Martin Potthast. Overview of PAN 2018 - author
identification, author profiling, and author obfuscation. In Experimental IR Meets
Multilinguality, Multimodality, and Interaction - 9th International Conference of the CLEF
Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings, pages
267–285, 2018.
10. Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. Authorship attribution with topic
models. Computational Linguistics, 40(2):269–310, 2014.
11. Pedro R. Mendes Júnior, Roberto M. de Souza, Rafael de O. Werneck, Bernardo V. Stein,
Daniel V. Pazinato, Waldir R. de Almeida, Otávio A. B. Penatti, Ricardo da S. Torres, and
Anderson Rocha. Nearest neighbors distance ratio open-set classifier. Machine Learning,
106(3):359–386, Mar 2017.
12. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
13. Travis Oliphant. NumPy: A guide to NumPy. USA: Trelgol Publishing, 2006.
14. Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python.
O’Reilly Media, 2009.
15. F. Sebastiani. Machine Learning in automated text categorization. ACM Computing
Surveys, 34(1):1–47, 2002.
16. J. Platt. Probabilistic outputs for support vector machines and comparison to regularize
likelihood methods. In A.J. Smola, P. Bartlett, B. Schoelkopf, and D. Schuurmans, editors,
Advances in Large Margin Classifiers, pages 61–74, 2000.
17. William J. Teahan and David J. Harper. Using Compression-Based Language Models for
Text Categorization, pages 141–165. Springer Netherlands, Dordrecht, 2003.
18. Martin Potthast, Sarah Braun, Tolga Buz, Fabian Duffhauss, Florian Friedrich, Jörg Marvin
Gülzow, Jakob Köhler, Winfried Lötzsch, Fabian Müller, Maike Elisa Müller, Robert
Paßmann, Bernhard Reinke, Lucas Rettenmeier, Thomas Rometsch, Timo Sommer,
Michael Träger, Sebastian Wilhelm, Benno Stein, Efstathios Stamatatos, and Matthias
Hagen. Who wrote the web? Revisiting influential author identification research applicable
to information retrieval. In Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane
Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello,
editors, Proc. of the European Conference on Information Retrieval, pages 393–407.
Springer International Publishing, 2016.
19. Mike Kestemont, Justin Anthony Stover, Moshe Koppel, Folgert Karsdorp, and Walter
Daelemans. Authenticating the writings of julius caesar. Expert Systems with Applications,
63:86–96, 2016.
20. José Eleandro Custódio and Ivandré Paraboni. EACH-USP Ensemble cross-domain
authorship attribution. In Linda Cappellato, Nicola Ferro, Jian-Yun Nie, and Laure Soulier,
editors, Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum,
CEUR Workshop Proceedings. CLEF and CEUR-WS.org, 2018.
21. Mike Kestemont, Michael Tschuggnall, Efstathios Stamatatos, Walter Daelemans, Günther
Specht, Benno Stein, and Martin Potthast. Overview of the author identification task at
PAN-2018: cross-domain authorship attribution and style change detection. In Working
Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum. CEUR-WS.org, 2018.
22. Lukas Muttenthaler, Gordon Lucas, and Janek Amann. Authorship Attribution in
Fan-fictional Texts Given Variable Length Character and Word n-grams. In Linda
Cappellato, Nicola Ferro, David E. Losada, and Henning Müller, editors, CLEF 2019 Labs
and Workshops, Notebook Papers. CEUR-WS.org, 2019.
23. Angelo Basile. An Open-Vocabulary Approach to Authorship Attribution. In Linda
Cappellato, Nicola Ferro, David E. Losada, and Henning Müller, editors, CLEF 2019 Labs
and Workshops, Notebook Papers. CEUR-WS.org, 2019.
24. Martijn Bartelds and Wietse de Vries. Improving Cross-Domain Authorship Attribution by
Combining Lexical and Syntactic Features. In Linda Cappellato, Nicola Ferro, David E.
Losada, and Henning Müller, editors, CLEF 2019 Labs and Workshops, Notebook Papers.
CEUR-WS.org, 2019.
25. Carolina Martín del Campo Rodríguez, Daniel Alejandro Pérez Álvarez, Christian
Efraín Maldonado Sifuentes, Grigori Sidorov, Ildar Batyrshin, and Alexander Gelbukh.
Authorship Attribution through Punctuation n-grams and Averaged Combination of SVM.
In Linda Cappellato, Nicola Ferro, David E. Losada, and Henning Müller, editors, CLEF
2019 Labs and Workshops, Notebook Papers. CEUR-WS.org, 2019.
26. Efstathios Stamatatos. Authorship attribution using text distortion. In Proceedings of the
15th Conference of the European Chapter of the Association for Computational Linguistics:
Volume 1, Long Papers, pages 1138–1149. Association for Computational Linguistics,
2017.
27. Andrea Bacciu, Massimo La Morgia, Alessandro Mei, Eugenio Nerio Nemmi, Valerio Neri,
and Julinda Stefa. Cross-domain Authorship Attribution Combining Instance Based and
Profile Based Features. In Linda Cappellato, Nicola Ferro, David E. Losada, and Henning
Müller, editors, CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org, 2019.
28. Fredrik Johansson and Tim Isbister. FOI Cross-Domain Authorship Attribution for
Criminal Investigations. In Linda Cappellato, Nicola Ferro, David E. Losada, and Henning
Müller, editors, CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org, 2019.
29. Hans van Halteren. Cross-Domain Authorship Attribution with Federales. In Linda
Cappellato, Nicola Ferro, David E. Losada, and Henning Müller, editors, CLEF 2019 Labs
and Workshops, Notebook Papers. CEUR-WS.org, 2019.
30. Mostafa Rahgouy, Hamed Babaei Giglou, Taher Rahgooy, Mohammad Karami Sheykhlan,
and Erfan Mohammadzadeh. Cross-domain Authorship Attribution: Author Identification
using a Multi-Aspect Ensemble Approach. In Linda Cappellato, Nicola Ferro, David E.
Losada, and Henning Müller, editors, CLEF 2019 Labs and Workshops, Notebook Papers.
CEUR-WS.org, 2019.
31. Martin Potthast, Tim Gollub, Matti Wiegmann, and Benno Stein. TIRA Integrated Research
Architecture. In Nicola Ferro and Carol Peters, editors, Information Retrieval Evaluation in
a Changing World - Lessons Learned from 20 Years of CLEF. Springer, 2019.
32. José Custódio and Ivandre Paraboni. Multi-channel Open-set Cross-domain Authorship
Attribution. In Linda Cappellato, Nicola Ferro, David E. Losada, and Henning Müller,
editors, CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org, 2019.
33. Lukasz Gagala. Authorship Attribution with Logistic Regression and Imposters. In Linda
Cappellato, Nicola Ferro, David E. Losada, and Henning Müller, editors, CLEF 2019 Labs
and Workshops, Notebook Papers. CEUR-WS.org, 2019.
34. Walter Scheirer, Anderson Rocha, Archana Sapkota, and Terrance Boult. Toward open set
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
35(7):1757–1772, 2013.
35. Chuanxing Geng, Sheng-jun Huang, and Songcan Chen. Recent advances in open set
recognition: A survey. arXiv preprint arXiv:1811.08581, 2018.
36. Efstathios Stamatatos, Walter Daelemans, Ben Verhoeven, Benno Stein, Martin Potthast,
Patrick Juola, Miguel A. Sánchez-Pérez, and Alberto Barrón-Cedeño. Overview of the
author identification task at pan 2014. In CLEF 2014 Evaluation Labs and Workshop –
Working Notes Papers, Sheffield, UK, September 2014.