=Paper=
{{Paper
|id=Vol-1866/invited_paper_3
|storemode=property
|title=Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering
|pdfUrl=https://ceur-ws.org/Vol-1866/invited_paper_3.pdf
|volume=Vol-1866
|authors=Michael Tschuggnall,Efstathios Stamatatos,Ben Verhoeven,Walter Daelemans,Günther Specht,Benno Stein,Martin Potthast
|dblpUrl=https://dblp.org/rec/conf/clef/TschuggnallSVDS17
}}
==Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering==
Overview of the Author Identification Task at
PAN-2017: Style Breach Detection and Author
Clustering
Michael Tschuggnall1 , Efstathios Stamatatos2 , Ben Verhoeven3 ,
Walter Daelemans3 , Günther Specht1 , Benno Stein4 , and Martin Potthast4
1
University of Innsbruck, Austria
2
University of the Aegean, Greece
3
University of Antwerp, Belgium
4
Bauhaus-Universität Weimar, Germany
pan@webis.de http://pan.webis.de
Abstract Several authorship analysis tasks require the decomposition of a multi-
authored text into its authorial components. In this regard two basic prerequisites
need to be addressed: (1) style breach detection, i.e., the segmenting of a text
into stylistically homogeneous parts, and (2) author clustering, i.e., the grouping
of paragraph-length texts by authorship. In the current edition of PAN we focus
on these two unsupervised authorship analysis tasks and provide both benchmark
data and an evaluation framework to compare different approaches. We received
three submissions for the style breach detection task and six submissions for the
author clustering task; we analyze the submissions with different baselines while
highlighting their strengths and weaknesses.
1 Introduction
An authorship analysis extracts information about the authors of given documents.
There are several related supervised tasks where a set of documents with known in-
formation about its authors is available and which can be used to train a model that can
extract this information from other documents. Typical examples are authorship attri-
bution (extract the identity of authors) [47] and author profiling (extract demographics
such as age and gender of the authors) [40]. The vast majority of published work fo-
cus on these two tasks. However, there are cases where authorship-related information
in a set of training documents is neither available nor reliable. Examples of unsuper-
vised tasks are intrinsic plagiarism detection (identification of plagiarized parts within
a given document without a reference collection of authentic documents) [51], author
clustering (grouping documents by authorship) [32, 44], and author diarization (decom-
posing a multi-authored document into authorial components) [2, 12, 29]. Unsupervised
authorship analysis tasks are more challenging but can be applied to every authorship
analysis case since they do not require any training material.
Previous editions of PAN focused on specific unsupervised tasks such as author
clustering or author diarization [50]. However, it has been observed that it was very
difficult for the submitted approaches to surpass even naive baseline methods. Given
the complexity of unsupervised tasks, it is essential to focus on fundamental problems
and to study them separately. In the current edition of PAN, we focus on two such
fundamental problems:
1. Segmentation of a multi-authored document into stylistically homogeneous parts.
We call this task style breach detection.
2. Grouping of paragraph-length document parts by authorship. We call this task au-
thor clustering.
These two tasks are elementary processing steps for both author diarization and in-
trinsic plagiarism detection. Style breach detection could also be useful in writing style
checkers, where it is required to ensure that homogeneous stylistic properties are found
within a document. Moreover, author clustering of short (paragraph-length) documents
could be useful in analysis of social media texts such as blog posts, comments, and
reviews. For example, author clustering could help to identify different user names that
correspond to the same person or user accounts that are used by multiple persons.
In this paper we present an overview of the shared tasks in style breach detection
and author clustering at PAN-2017. We received three submissions for the former and
six submissions for the latter task. The evaluation framework including benchmark data,
evaluation measures, and baseline methods is described. In addition, we present an anal-
ysis and a survey of the submitted methods.
2 Previous Work
This section reviews related work on style breach detection and author clustering.
2.1 Style Breach Detection
The goal is to find positions within a document where the authorship changes, i.e.,
where the style changes. Thus, it is closely related to all fields within stylometry, es-
pecially intrinsic plagiarism detection [52]. Several approaches exist that deal with the
latter, basically by creating stylistic fingerprints that include lexical features such as
character n-grams [30, 48], word frequencies [21] or average word/sentence lengths
[57], syntactic features such as Part-of-Speech (POS) tag frequencies/structures [54],
structural features such as average paragraph lengths, or indentation usages [57]. Using
these fingerprints, outliers are sought, either by applying different distance metrics on
sliding windows [48] or by storing distance matrices [53, 24].
In contrast, related work targeting multi-author documents is rare. One of the first
approaches that uses stylometry to automatically detect boundaries of authors of col-
laboratively written text was proposed by Glover and Hirst [15], with the aim to provide
hints in order to produce a homogeneously written text. Graham et al. [18] utilize neu-
ral networks with several stylometric features, and Gianella [12] proposes a stochastic
model on the occurrences of words to split a document by authorship. An unsupervised
decomposition of multi-author documents based on grammar features has been evalu-
ated by Tschuggnall and Specht [56].
The diarization task at PAN-2016 [49] dealt with building author clusters within
documents. The two submitted approaches use n-grams and other selected stylometric
features in combination with a classifier post-processed by a Hidden Markov Model
[31], as well as a sentence-based distance metric, computed from several features, that
is given to a k-means algorithm in order to build clusters [46].
From a global point of view, style breach detection can also be seen as a text seg-
mentation problem that where a document is split into segments based on the writing
style. Common text segmentation approaches divide a text by different topics and/or
genres [6]. Compared to an intrinsic stylometric analysis, those approaches have the
advantage to be able to build dictionaries or other useful statistics for each targeted
topic or genre in advance. Thereby, a wide range of methods is used, often based on the
research by Hearst [20], in which the lexical cohesion of terms is analyzed. Other ap-
proaches use Bayesian models [10], Hidden Markov Models [5], vocabulary analysis in
various forms such as word stem repetitions [36] or word frequency models [41]. While
some of the recent papers [42, 34] compare the segmentation approaches on the same
data sets, it is in general difficult to compare performances due to the heterogeneous
problems and data types.
2.2 Author Clustering
Previous work on author clustering (also called author-based clustering, authorship
clustering, or authorial clustering), as it is defined in this paper, is limited. Iqbal et
al. [22] describe an approach based on k-means clustering which requires that the num-
ber of authors is known, and apply it to a collection of e-mail messages. Layton et al.
[32] propose a method that can automatically estimate the number of clusters (authors)
in a collection of documents using the iterative positive Silhouette method. The latter
has been demonstrated to be useful for clustering validation purposes [33]. These tech-
niques have been applied to literary texts (either books or book samples). Samdani et al.
[44] analyze postings in a discussion forum using an online clustering method. Daks &
Clark [8] use POS n-grams and spectral clustering and tested their method in a variety
of corpora including newspaper articles, political speeches, and literary texts.
A shared task on author clustering documents was included in PAN-2016 [50]. The
benchmark collections built for this task comprised texts in three languages (English,
Dutch, and Greek) and two genres (opinion articles and reviews) taken from various
sources. A total of eight submissions was received and the best-performing model was
based on a successful authorship verification method using a character-level multi-
headed recurrent neural network [4], closely followed by a simple approach based on
word and punctuation mark frequencies [27]. Both of these methods first compute pair-
wise distances between texts and then form clusters by joining texts that belong to a
path of small distances. In general, simple baseline methods, such as placing each text
in a distinct cluster were found very competitive since in the benchmark collections the
number of single-item clusters was high (>50%) or very high (>75%) [50].
3 Style Breach Detection
As a specific type of author identification, the style breach detection task at PAN-2017
focuses on finding stylistic differences within a text document as a result of having
multiple authors collaborating on it. The main goal is to identify style breaches, i.e.,
exact positions in the text where the authorship changes. Thereby no training data is
available for the corresponding authors, nor can respective information be gained from
potential web searches. From this perspective, this year’s task attaches to a series of
subtasks of previous PAN events that focused on intrinsic characteristics of text doc-
uments. Including intrinsic plagiarism detection [37] and author diarization [49], the
main commonality is that the style of authors has to be extracted and quantified in some
way in order to tackle those problems. In a similar way, an intrinsic analysis of the writ-
ing style is also key to approach the PAN-2017 style breach detection task, which can
be summarized as follows:
Given a document determine whether it is multi-authored
and, if yes, find the borders where authorship switches.
In contrast to the author clustering task described in Section 4, the goal is to find
only borders, and thus it is irrelevant to identify or cluster authors of segments.
The detection of style breaches, i.e., locating borders between different authors, can
be seen as an application of a general text segmentation problem. Nevertheless, a sig-
nificant difference to existing segmentation approaches is that while the latter usually
focus on detecting switches of topics or stories [20, 42, 34], the aim of this subtask is
to identify borders based on the writing style, disregarding the specific content. While
segmentation algorithms may include metrics built from precomputed dictionaries com-
prising different topics or genres, an additional difficulty results from the fact that the
content type is not known and, more importantly, coherent throughout a document.
3.1 Approaches at PAN-2017
This year, five teams registered to the style breach detection task, whereas three of them
actively submitted their software to TIRA [16, 38]. In the following, a short summary
of each approach is given. Moreover, the creation of two slightly different baseline
approaches for comparison is explained.
– Karaś, Śpiewak & Sobecki [9]. The authors start by splitting the document into
paragraphs, either by detecting multiple newline characters, or, if there are no new-
lines, by choosing a fixed number of sentences. In the following it is then assumed
that style breaches may occur only on paragraph endings. To quantify the style of
each paragraph, tf-idf matrices are computed using single words, word 3-grams,
stop words, POS tags, and punctuation characters. By concatenating all tf-idf ma-
trices, a paired samples Wilcoxon Signed Rank test [13] is applied—a statistical
method to verify whether two given samples stem from the same distribution or
not. Computing this test for all pairs of consecutive paragraphs finally yields the
final prediction, where a style breach is predicted if the test suggests that the para-
graphs come from a different distribution.
– Khan [25]. The author also segments the document into sentences within a pre-
processing step. Then, the sentences are traversed using two sliding windows that
share a sentence in the middle. For each window several statistics are computed,
including most frequent POS tags, non-/alphanumeric characters, or words. More-
over, word statistics based on precomputed dictionaries are utilized which include
common English words and several sentiment dictionaries. Using all metrics, a
similarity score between the two adjacent sliding windows is calculated which is
finally compared to a predefined threshold in order to decide whether or not the po-
sition before/after the overlapping sentence is predicted as style breach. In the latter
case the two sliding windows are merged and considered to be written by a single
author; a new second window is created, which is processed as described earlier.
– Safin & Kuznetsova [43]. The authors approach the style breach detection task by
applying a sentence outlier detection, commonly used in intrinsic plagiarism de-
tection algorithms [48, 55]. After splitting the document into sentences, each one
is vectorized using two pretrained skip-thought models [26]. These models can be
seen as word embeddings operating on sentences as the atomic units, thereby re-
sulting in 2,400 dimensions for each sentence. The distance between each distinct
pair of sentences is stored in a distance matrix (similar to, e.g., [53, 24]) by calcu-
lating the cosine distance of the corresponding vectors. Finally, an outlier detection
is performed using an optimized threshold, which is compared to the average dis-
tance of each sentence. If the distance of a sentence s is larger than the threshold,
the beginning of s is marked as a style breach position.
– Baselines. To be able to compare the performances of the submitted approaches,
two simple baselines have been computed:
1. BASELINE-rnd randomly places between 0 and 10 borders at arbitrary posi-
tions inside a document.
2. As a slight variant, BASELINE-eq also decides on a random basis how many
borders should be placed (also 0-10), but then places the borders uniformly,
i.e., such that all resulting segments are of equal size with respect to tokens
contained.
Both baselines have been computed based on the average of 100 runs.
3.2 Data Set
To develop and optimize the respective algorithms, distinct training and test data sets
have been provided, which are based on the Webis-TRC-12 data set [39]. The original
corpus already served as origin for the PAN’16 diarization data set [49] and contains
documents on 150 topics used at the TREC Web Tracks from 2009-2011 [7], which was
created by hiring professional writers through crowdsourcing. Each writer was asked
to search for a given topic including assignments (e.g., “Barack Obama”, assignment:
include information about Obama’s family) and to compose a single document from the
search results. All sources of the resulting document are annotated respectively, so the
origin of each text fragment is known.
Table 1. Parameters for generating the data sets.
Parameter Value/s
number of style breaches 0-8
number of collaborating authors 1-5
document length ≥ 100 words
average segment length ≥ 50 words
border positions (only) at the end of sentences / paragraphs
segment length distribution equalized / randomly
Assuming that each distinct source represents a different author in the original data
set, a training and a test data set have been randomly created from these documents by
varying several parameters as shown in Table 1. Beside the number of style breaches
or collaborating authors, also authorship boundary types have been altered to be at
paragraph or sentence levels, i.e., authors may switch only at the end of paragraphs1
or also within paragraphs. Nevertheless, in order to not overcomplicate the task and to
build more realistic data sets, the atomic units were set to be sentences, i.e., borders
may not occur within sentences. With respect to the resulting segment lengths, it has
been varied whether they are equalized to be of similar lengths or of random lengths
within a document.
As the original corpus has been partly used and published, the test documents have
been created from previously unpublished documents only. Overall, the number of doc-
uments in the training data set is 187, whereas the test data set contains 99 documents.
The final statistics of the generated data sets are presented in Table 2.
3.3 Experimental Setup
Performance Measures The performance of the submitted algorithms have been mea-
sured with two common metrics used in the field of text segmentation. The WindowDiff
metric [35], which is proposed for general text segmentation evaluation, is computed
as it still is used widely for such problems. It calculates an error rate between 0 and 1
for predicting borders (whereby 0 indicates a perfect prediction), by penalizing near-
misses less than other/complete misses or extra borders. Depending on the problem
types and data sets used, text segmentation approaches report near-perfect windowDiff
values of less than 0.01, while on the other side the error rate exceeds values of 0.6
and higher under certain circumstances [14]. As an alternative, a more recent adaption
of the WindowDiff metric is the WinPR metric [45]. It enhances WindowDiff by com-
puting the common information retrieval measures precision (WinP) and recall (WinR)
and thus allows to give a more detailed, qualitative statement about the prediction. In-
ternally, WinP and WinR are computed based on the calculation of true and false posi-
tives/negatives, respectively, also assigning higher values if predicted borders are closer
to the real border position.
Both metrics are computed on a word-level, whereby the participants were asked
to provide character positions. This means that the tokenization was delegated to the
1
to be identified by at least two consecutive line breaks
Table 2. Data set statistics.
Train Test
#documents 187 99
0 36 (19%) 20 (20%)
1-3 81 (43%) 44 (44%)
#style breaches
4-6 45 (24%) 25 (25%)
7-8 25 (13%) 10 (10%)
1 36 (19%) 20 (20%)
#authors 2-3 84 (45%) 44 (44%)
4-5 67 (36%) 35 (35%)
< 500 13 (7%) 8 (8%)
500-1000 42 (22%) 24 (24%)
document length 1000-2000 77 (41%) 50 (51%)
(words) 2000-3000 40 (21%) 13 (13%)
3000-4000 10 (5%) 1 (1%)
>= 4000 5 (3%) 3 (3%)
< 100 8 (4%) 3 (3%)
100-250 56 (30%) 28 (28%)
average segment length
250-500 61 (33%) 43 (43%)
(words)
500-1000 48 (26%) 20 (20%)
>= 1000 14 (7%) 5 (5%)
sentence 90 (48%) 46 (46%)
border position
paragraph 97 (52%) 53 (54%)
segment length equalized 94 (50%) 55 (56%)
distribution random 93 (50%) 44 (44%)
evaluator script. For the final ranking of all participating teams, the F-score of WinPR
(WinF) is used.
Workflow The participants designed and optimized their approaches with the given,
publicly available training data set described earlier. The performance could be mea-
sured either locally using a provided evaluator script, or by uploading the respective
software to TIRA [16, 38] and running it against the training data set. Because the test
data set was not publicly available, it was necessary to use the latter option in this case.
I.e., the participants submitted their final software and ran it against the test data without
seeing performance results. It was manually ensured that no potentially helpful infor-
mation about the data set was publicly logged during the execution. Finally, participants
were allowed to submit three successful test data runs, whereby the latest submissions
are used for the final ranking and for all results presented in Section 3.4.
3.4 Results
The final results of the three submitting teams are shown in Table 3. In case of WinF, the
baseline equalizing the segment sizes could be exceeded by only one approach, whereas
Table 3. Style breach detection results. Participants are ranked according to their WinF score.
Rank Participant WinP WinR WinF WindowDiff Runtime
1 Karaś et al. 0.315 0.586 0.323 0.546 00:01:19
– BASELINE-eq 0.337 0.645 0.289 0.647 –
2 Khan 0.399 0.487 0.289 0.480 00:02:23
3 Safin et al. 0.371 0.543 0.277 0.529 00:20:25
– BASELINE-rnd 0.302 0.534 0.236 0.598 –
the baseline using completely random positions could be outperformed by all partici-
pants. With respect to WindowDiff, all approaches perform better than both baselines.
Interestingly, besides achieving the best WinF performance, Karaś et al. also needed
the shortest runtime for the prediction, whereas Safin et al. required the significantly
longest runtime with over 20 minutes, probably by applying the cost-intensive neural
sentence embedding technique [43].
Figure 1 depicts details about performances of all approaches including BASE-
LINE"=eq with respect to several parameters. In case of number of style breaches (a), it
can be seen that there is a significant difference between the approaches when analyzing
documents with no author switches. While the overall winning approach of Karaś et al.
performs poorly, Safin et al. achieve their best score for these documents. The result
of the latter may be caused by the intrinsic plagiarism detection type of approach that
distinguishes between documents containing suspicious sentences and plagiarism-free
documents, i.e., containing style breaches or not. The other approaches assume style
borders to be existent, which accounts also for the baseline in over 90% of the cases as
it chooses a random number of borders between 0-10.
While the number of authors (b) seems to have no significant impact on the perfor-
mances2 , the document length (c) influences the results, especially for very short and
very long documents, respectively. The approach of Karaś et al. basically gets better
with the length of the text, achieving best results for the majority of documents within
1,000-2,000 words. Khan achieves good results for both short and long documents,
while Safin et al. scores good only in the former case. With respect to the average seg-
ment length, the performance of the winning approach of Karaś et al. decreases drasti-
cally for segment lengths of over 500 words. Nevertheless, it achieves good results for
documents with shorter segments, and, remarkably, the highest score for the documents
of very short segment lengths.
Finally, the impact of the border position and the segment length distribution is
shown in subfigures (e) and (f) respectively. For the border position, only Karaś et al.
indicate a significant improvement when style breaches appear only on paragraph ends.
This reflects also their approach, which treats paragraphs as potential natural border
positions, and if no paragraphs exist, creates artificial paragraphs using a fixed length
of sentences. Moreover, this may also be the reason why the approach performs better
for segments of similar lengths, as this scenario better matches the specified creation of
artificial paragraphs.
2
except for the distinction between one or more authors, which is already shown in subfigure (a),
where number of style breaches = 0 corresponds to number of authors = 1
0,6 0,6
0,5 0,5
Safin et al.
Karas et al.
baseline
0,4 0,4
WinF
WinF
0,3 0,3
Khan
0,2 0,2
0,1 0,1
0 0
0 (20) 1 (15) 2 (13) 3 (16) 4 (12) 5 (9) 6 (4) 7 (10) 2 (26) 3 (18) 4 (24) 5 (11)
(a) Number of Style Breaches (b) Number of Authors
0,6 0,6
Safin et al.
0,5 0,5
Khan
Karas et al.
0,4 0,4
baseline
WinF
WinF
0,3 0,3
0,2 0,2
0,1 0,1
0 0
< 500 (8) 500-1k (24) 1k-2k (50) 2k-3k (13) >=3k (4) < 100 (3) 100-250 (28) 250-500 (43) 500-1k (20) >=1k (5)
(c) Document Length in Words (d) Average Segment Length in Words
0,6 0,6
0,5 0,5
0,4 0,4
Safin et al.
Karas et al.
baseline
Khan
WinF
WinF
0,3 0,3
0,2 0,2
0,1 0,1
0 0
sentence (46) paragraph (53) equalized (55) random (44)
(e) Border Position (f) Segment Length Distribution
Figure 1. Style breach detection results with respect to number of style breaches, number of au-
thors, document length, average segment length, border position and segment length distribution.
To highlight the potential of the approaches, their individual best results are listed in
Table 4. The upper part shows the best configurations for single-authored documents,
while the lower part presents the best performances for documents containing style
breaches. Again it can be seen that Karas̀ et al. assume style breaches to be existent and
thus reaches very poor results if a document contains no breaches. On the other side,
Table 4. Best style breach detection results per approach for single-authored documents and
documents containing style breaches.
Participant #breaches #authors doc. len. border seg. distr. WinF WindowDiff
Karaś et al. 0 1 418 sent eq 0.059 0.145
Khan 0 1 337 sent rand 1.000 0.000
Safin et al. 0 1 365 par rand 1.000 0.000
Karaś et al. 1 2 1027 sent eq 0.877 0.082
Khan 1 2 955 par rand 0.806 0.130
Safin et al. 3 4 692 par rand 0.634 0.251
Khan as well as Safin et al. achieve perfect prediction rates, i.e., estimating correctly
that there are no style borders3 . In case of documents containing style breaches, Karaś
et al. and Khan gain very good top results with WinF scores of over 80%.
4 Author Clustering
4.1 Task Definition
Given a collection D of short (paragraph-length) documents we approach the author
clustering task following two scenarios:
– Complete Clustering. All documents should be assigned to clusters whereas each
cluster corresponds to a distinct author. More specifically, each document d ∈ D
should be assigned to exactly one of k clusters, while k is not given.
– Authorship-Link Ranking. Pairs of documents by the same author (authorship-
links) should be extracted. For each authorship-link (di , dj ) ∈ D × D, a confi-
dence score belonging to [0,1] should be estimated and authorship-links are ranked
in decreasing order.
All documents within a clustering problem are single-authored, in the same lan-
guage, and belong to the same genre; however, topic and text-length may vary. The
main difference with respect to the corresponding PAN-2016 [50] task is that the doc-
uments are short including a few sentences (paragraph length). This makes the task
harder since text-length is crucial when attempting to extract stylometric information.
4.2 Evaluation Datasets
The datasets used for training and evaluation were extracted from the corresponding
PAN-2016 corpora that include three languages (English, Dutch, and Greek) and two
genres (articles and reviews). Each PAN-2016 text was segmented into paragraphs and
3
not shown in the Table, Khan and Safin et al. achieve perfect prediction for several of the
documents containing no style breaches
Table 5. The author clustering corpus. Average clusteriness ratio (r), number of documents (N),
number of authors (k), number of authorship links, maximum cluster size (maxC), and words per
document are given.
Language Genre Problems r N k Links maxC Words
English articles 10 0.3 20 5.6 57.3 9.2 52.6
English reviews 10 0.3 19.4 6.1 45.4 8.2 62.2
Training
Dutch articles 10 0.3 20 5.3 61.6 9.8 51.8
Dutch reviews 10 0.4 18.2 6.5 19.7 4.0 140.6
Greek articles 10 0.3 20 6.0 38.0 6.7 48.2
Greek reviews 10 0.3 20 6.1 41.6 7.5 39.4
English articles 20 0.3 20 5.7 59.3 9.5 52.5
English reviews 20 0.3 20 6.4 43.5 7.9 65.3
Dutch articles 20 0.3 20 5.7 49.4 8.3 49.3
Test
Dutch reviews 20 0.4 18.4 7.1 19.3 4.1 152.0
Greek articles 20 0.3 19.9 5.2 59.6 9.6 46.6
Greek reviews 20 0.3 20 6.0 42.2 7.6 37.1
all paragraphs with less than 100 characters and more than 500 characters were dis-
carded. In each clustering problem, documents by the same authors were selected ran-
domly by all original documents. This means that paragraphs of the same original doc-
ument or other documents (by the same author) may be grouped. Certainly, when para-
graphs come from the same original document, there is much larger thematic similarity.
The only exception in this process was the Dutch reviews corpus because the texts were
already short (one paragraph each). In this case, the PAN-2017 datasets were built using
the PAN-2016 procedure.
Table 5 shows details about the training and test datasets. Most of the clustering
problems include 20 documents (paragraphs) by an average of 6 authors. In each clus-
tering problem there is an average of about 50 authorship links and the largest cluster
contains about 8 documents. Each document has an average of about 50 words. Note
that in the case of Dutch reviews these figures deviate from the norm (documents are
longer and authorship links are less).
An important factor to each clustering problem is the clusteriness ratio r = k/N ,
where N is the size of D. When r is high, most documents belong to single-item clusters
and there are few authorship links. When r is low, most documents belong to multi-item
clusters and there are plenty of authorship links. Estimating r (since N is known, k
should be estimated) is crucial for each clustering algorithm. In PAN-2016 three specific
values r=0.5, r=0.7, and r=0.9 were used focusing on relatively high values of r [50].
In the current edition, in both training and test datasets, r ranges between 0.1 and 0.5. as
can be seen in Figure 2. This means that the PAN-2017 corpus has far less single-item
clusters in comparison to PAN-2016.
4.3 Performance Measures
The same evaluation measures introduced in PAN-2016 are used. As a consequence,
the results are directly comparable to the ones from the corresponding PAN-2016 task.
Figure 2. Distribution of clusteriness ratio r values in the test dataset problems.
In more detail, for the complete clustering scenario, Bcubed Recall, Bcubed Precision,
and Bcubed F-score are calculated. These are among the best extrinsic clustering evalu-
ation measures and were found to satisfy several important formal constraints including
cluster homogeneity, cluster completeness, etc. [3] With respect to the authorship-link
ranking scenario, established measures are used to estimate the ability of systems to
rank high correct results. These are Mean Average Precision (MAP), R-precision, and
P@10.
4.4 Baselines
To understand the complexity of the tasks and the effectiveness of participating systems
we used a set of baseline approaches and applied them to the evaluation datasets. The
baseline methods range from naive to strong and will allow to estimate weaknesses and
strengths of participant approaches. More specifically, the following baseline methods
were used:
– BASELINE-Random. Given a set of documents, the method randomly chooses the
number of authors and randomly assigns each document to one of the authors. It
extracts all authorship links from the produced clusters and assigns a random score
to each one of them. The average performance of this method over 30 repetitions is
reported. This naive approach can only serve as an indication of the lowest perfor-
mance.
– BASELINE-Singleton. This method sets k = N , that is all documents are by dif-
ferent authors. It forms singleton clusters and no authorship links. As a result, it is
used only for the complete clustering scenario. This simple method was found very
effective in PAN-2016 datasets and its performance increases with r [50]. Since
the range or r is lower in PAN-2017 datasets, its performance should be negatively
affected.
– BASELINE-Cosine. Each document is represented by the normalized frequencies
of all words occurring at least 3 times in the given collection of documents. Then,
for each pair of documents the cosine similarity is calculated and it is used as an
authorship-link score. This simple method is only used in the authorship-link rank-
ing scenario and it was found hard-to-beat in PAN-2016 evaluation edition [50].
– BASELINE-PAN16. This is the top-performing method submitted to the corre-
sponding PAN-2016 task. It is based on a character-level recurrent neural network
and it is a modification of an effective authorship verification approach [4]. There
was no attempt to modify this method to be more suitable for the PAN-2017 corpus.
Given that it follows a highly conservative approach to form multi-item clusters
(suitable for the PAN-2016 corpus) its Bcubed recall is expected to be very low in
the current corpus.
4.5 Survey of Submissions
We received six submissions from research teams in Cuba [11], Germany [19], the
Netherlands [1], Mexico [17], Poland [23], and Switzerland [28]. All participants also
submitted a notebook paper describing their approach.
In general, all submissions follow a bottom-up paradigm where first the pairwise
similarity between any pair of documents is estimated and then this information is used
to form clusters. Gómez-Adorno et al. use hierarchical agglomerative clustering [17]
while García et al. use β-compact graph-based clustering. Kocher & Savoy apply some
merging criteria in the pairwise similarities [28]. Alberts [1] proposes a modification
of a similar method submitted to PAN-2016 [27]. Halvani & Graner use the k-medoids
clustering algorithm and Karaś et al. are based on a variation of locality-sensitive hash-
ing [23].
To calculate the pairwise (dis)similarity between documents in a given collection
Alberts and Kocher & Savoy propose simple formulas that compare two probability
distributions. García et al. use Dice index, Gómez-Adorno et al. use cosine similarity
while Halvani & Graner are based on the Compression-based Cosine measure.
A crucial issue is how to estimate the number of clusters k in a given collection
of documents. A common choice is the use of Silhouette coefficient to indicate the
most suitable number of clusters [19, 23] while Gómez-Adorno et al. use the Calinski-
Harabasz index [17]. Another idea is the use of graph-based clustering methods that can
be automatically adopted to a clustering problem [1, 11].
As concerns the stylometric measures used to quantify the personal style of authors,
most of the submissions are based on low-level character or lexical features such as
word and character n-grams. García et al. was the only submission experimenting with
higher-level features requiring NLP tools such as lemmatizers and POS taggers, only for
the English datasets. Some submissions used a single type of features (e.g., character n-
grams [1, 28]) while others used a pool of different feature types and attempted to select
the most suitable type (or combination of types) for each language and genre [17, 11].
A feature-agnostic compression-based approach is proposed by Halvani & Graner [19].
Table 6. Overall evaluation results in author clustering (mean values for all clustering problems).
Participants are ranked according to Bcubed F-score.
Participant Complete clustering Authorship-link ranking Runtime
B3 F B 3 rec. B 3 prec. MAP RP P@10
Gómez-Adorno et al. 0.573 0.639 0.607 0.456 0.417 0.618 00:02:06
García et al. 0.565 0.518 0.692 0.381 0.376 0.535 00:15:49
Kocher & Savoy 0.552 0.517 0.677 0.396 0.369 0.509 00:00:42
Halvani & Graner 0.549 0.589 0.569 0.139 0.251 0.263 00:12:25
Alberts 0.528 0.599 0.550 0.042 0.089 0.284 00:01:46
BASELINE-PAN16 0.487 0.331 0.987 0.443 0.390 0.583 50:17:49
Karaś et al. 0.466 0.580 0.439 0.125 0.218 0.252 00:00:26
BASELINE-Singleton 0.456 0.304 1.000 – – – –
BASELINE-Random 0.452 0.339 0.731 0.024 0.051 0.209 –
BASELINE-Cosine – – – 0.308 0.294 0.348 –
4.6 Evaluation Results
All participant methods were submitted to the TIRA experimentation platform where
the participants were able to run their software on both training and test datasets [16,
38]. PAN organizers provided feedback in case a run produced errors or unexpected
output. The participants were given the opportunity to perform at most two runs on the
test dataset and have been informed about the evaluation results. However, the last run
was always considered for the final evaluation.
Table 6 shows the overall evaluation results for both complete clustering and
authorship-link ranking on the entire test dataset. The elapsed runtime of each submis-
sion is also reported. As can be seen, the method of Gómez-Adorno et al. [17] achieves
the best results in both scenarios. Actually, this is the top-performing method taking
into account all but one evaluation measures, that is BCubed precision. By definition,
BASELINE-singleton achieves perfect Bcubed precision since it provides single-item
clusters exclusively. Moreover, BASELINE-PAN16 attempts to optimize precision by
following a very conservative strategy when multi-item clusters are considered. Within
the PAN-2017 submissions, the approaches of García et al. [11] and Kocher & Savoy
[28] are the best ones in terms of Bcubed precision. However, the winning approach of
Gómez-Adorno et al. [17] is the only one that achieves both Bcubed recall and precision
higher than 0.6. As concerns efficiency, almost all submitted approaches are quite fast.
The approaches of García et al. [11] and Halvani & Graner are relatively slower than the
rest of submissions. However, both of them are much faster than the very demanding
approach of BASELINE-PAN16.
Table 7 provides a more detailed view of performance (Bcubed F-score) in each
evaluation dataset separately for the complete clustering scenario. All submitted meth-
ods are better than BASELINE-Singleton and BASELINE-Random in overall results.
Actually, the performance of these two baseline methods is quite similar, in contrast
to the results of PAN-2016 [50]. Moreover, all but one submission were better than
BASELINE-PAN16. These observations can be explained by the low values of cluster-
iness ratio (r) used in PAN-2017 datasets. This means that single-item clusters are not
Table 7. Evaluation results (BCubed F-score) per language and genre for the complete clustering
scenario. Participants are ranked according to overall Bcubed F-score.
Participant Overall English English Dutch Dutch Greek Greek
articles reviews articles reviews articles reviews
Gómez-Adorno et al. 0.573 0.618 0.565 0.679 0.474 0.544 0.552
García et al. 0.565 0.567 0.578 0.614 0.603 0.489 0.513
Kocher & Savoy 0.552 0.607 0.570 0.586 0.535 0.511 0.506
Halvani & Graner 0.549 0.534 0.528 0.606 0.519 0.566 0.533
Alberts 0.528 0.523 0.487 0.56 0.536 0.524 0.536
BASELINE-PAN16 0.487 0.477 0.483 0.485 0.570 0.426 0.485
Karaś et al. 0.466 0.508 0.428 0.461 0.474 0.498 0.464
BASELINE-singleton 0.458 0.436 0.475 0.438 0.543 0.403 0.455
BASELINE-random 0.452 0.441 0.462 0.437 0.508 0.415 0.450
the majority in PAN-2017 datasets and approaches that attempt to optimize precision
over recall are not equally effective as in PAN-2016. Note that in the case of Dutch re-
views where r is higher, BASELINE-Singleton and BASELINE-PAN16 are improved.
The approaches of Gómez-Adorno et al. [17], Kocher & Savoy [28], and Halvani
& Graner seem to be more effective on articles rather than reviews, while the method
of García et al. [11] is not affected significantly by genre. Moreover, the methods of
García et al. and Kocher & Savoy seem to be better able to handle English and Dutch
texts rather than Greek.
Figure 3. BCubed recall for varying clusteriness ratio values in the test dataset problems.
Figures 3 and 4 show Bcubed recall and precision for varying values of the clus-
tering ratio r in the entire test dataset. As can be seen, the tendency for Bcubed recall
is to improve, while Bcubed precision is decreased as r increases. BASELINE-PAN16
suffers in recall that increases almost linearly with r while it maintains almost perfect
precision. The approach of Gómez-Adorno et al. achieves the most balanced recall and
precision scores especially for relatively low r values. The rest of submissions either
Figure 4. BCubed precision for varying clusteriness ratio values in the test dataset problems.
Table 8. Evaluation results (MAP) per language and genre for the authorship-link ranking sce-
nario. Participants are ranked according to overall MAP.
Participant Overall English English Dutch Dutch Greek Greek
articles reviews articles reviews articles reviews
Gómez-Adorno et al. 0.455 0.551 0.491 0.534 0.311 0.482 0.422
BASELINE-PAN16 0.443 0.554 0.463 0.532 0.303 0.505 0.302
Kocher & Savoy 0.395 0.470 0.386 0.440 0.307 0.445 0.384
García et al. 0.380 0.376 0.421 0.432 0.318 0.426 0.366
BASELINE-cosine 0.308 0.388 0.274 0.315 0.211 0.386 0.273
Halvani & Graner 0.139 0.117 0.129 0.152 0.097 0.192 0.145
Karaś et al. 0.125 0.133 0.105 0.138 0.079 0.176 0.148
Alberts 0.042 0.043 0.048 0.049 0.046 0.035 0.029
BASELINE-random 0.024 0.027 0.022 0.023 0.021 0.026 0.023
favor recall (Alberts [1], Halvani & Graner [19], Karaś [23]) or precision (García et al.
[11], Kocher & Savoy [28]).
Table 8 shows the evaluation results (MAP) per language and genre for the
authorship-link ranking scenario. Here, BASELINE-PAN16 is quite competitive and
only the method of Gómez-Adorno et al. [17] is able to surpass it. Moreover,
BASELINE-Cosine is quite strong and outperforms half of submissions. Recall that
the winning approach of Gómez-Adorno et al. is also based on cosine similarity using a
richer set of features and a log-entropy weighting scheme. In general, almost all submis-
sions achieve their worst results in the Dutch reviews dataset. Recall from Table 5 that
this dataset has distinct characteristics. Despite the fact that it contains longer texts with
respect to the rest of datasets, Dutch reviews form the most difficult case. It seems that
the method of Gómez-Adorno et al. [17], Kocher & Savoy [28], and Karaś et al. [23]
are better in handling articles than reviews. The same is true for BASELINE-Cosine,
indicating that thematic information is more useful in articles. In the authorship-link
scenario, the language factor seems not to be crucial since the evaluation results are
balanced over the three examined languages.
Figure 5. MAP for varying clusteriness ratio values in the test dataset problems.
Figure 5 demonstrates how MAP values are affected by the clusteriness ratio; there
are two groups of methods: a group that contains strong methods that can compete
with BASELINE-PAN16, and another weak group with low results (also lower from
BASELINE-Cosine). Clearly, the method of Gómez-Adorno et al. and BASELINE-
PAN16 are better than the methods of Kocher & Savoy and García et al. practically
in the whole range of r. The winning approach of Gómez-Adorno et al. is better than
BASELINE-PAN16 for relatively high values of r. In addition, the approach of García
et al. surpasses the method of Kocher & Savoy only for high values of r. Recall that
when r increases, there are less true authorship-links.
5 Discussion
The author identification task at PAN-2017 focused on unsupervised author analysis
by decomposing text documents into their authorial components. To study different as-
pects in detail, two separate subtasks were addressed: (1) style breach detection, aiming
to segment a document by stylistic characteristics, and (2) author clustering, aiming to
group paragraph-length texts by authorship as well as assigning confidence scores be-
tween documents written by the same author. For both tasks, comprehensive data sets
have been provided, which allowed participants to train their approaches on the respec-
tive training part prior to evaluating them against the inaccessible test part. Although
both subtasks seem not that different to approach, e.g., by computing similar stylometric
fingerprints, results indicate that intrinsically segmenting a text into distinct authorial
components is hard to be tackled. On the other hand, the gap for building clusters of al-
ready segmented texts could be narrowed, in large parts due to the outcomes of similar
studies conducted in previous PAN events.
For the style breach detection subtask, three approaches have been submitted, uti-
lizing common stylometric features and word dictionaries in combination with different
distance metrics, or by applying a neural network similar to the word embeddings tech-
nique. Although all approaches achieved a better performance than the simple random
baseline, only one of them could exceed a slightly enhanced baseline, which is also
based on random guesses. Interestingly, this winning approach considers only the ends
of preformatted paragraphs as possible segment borders, and, if no paragraphs exist,
creates artificial paragraphs of predefined, fixed lengths. This fact underlines that there
is still room for significant improvements, e.g., by dynamically adjusting the borders.
Moreover, another approach basically used an intrinsic plagiarism detection method,
which aims at outlier detection over the whole document, marking them as borders.
Clearly, tackling the style breach detection task with this method is not optimal since the
intrinsic plagiarism detection algorithms assume a main author to be existent. Finally,
non of the approaches utilized standard machine learning techniques such as support
vector machines. Considering the findings of recent research using such techniques on
textual data, it can be assumed that—if optimized and used accordingly—the perfor-
mance of style breach detection algorithms can be improved significantly.
For the author clustering task, in comparison to the evaluation results of the cor-
responding task at PAN-2016, the submitted methods achieved lower Bcubed F-score.
However, this should not be explained by the fact that text-length in PAN-2017 datasets
is much lower. A more crucial factor is the much lower range of the clusteriness ratio
r which limits the number of single-item clusters and significantly increases the num-
ber of true authorship-links. As a result, the performance of naive baseline methods,
like BASELINE-Singleton, is not so competitive as in the corresponding task at PAN-
2016. Moreover, MAP scores are considerably increased in comparison to PAN-2016.
Given that the MAP scores of BASELINE-PAN16 are also improved with respect to its
performance on PAN-2016 datasets, this can be largely explained by the low values of
clusteriness ratio again. The success of the top-performing submission shows that very
good results can be obtained by using well-known clustering methods and similarity
functions given that a suitable feature set and feature weighting scheme is selected for
each dataset [17].
Bibliography
[1] Alberts, H.: Author clustering with the aid of a simple distance measure. In: Cappellato,
L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Working Notes. CEUR Workshop
Proceedings, CLEF and CEUR-WS.org (2017)
[2] Aldebei, K., He, X., Jia, W., Yang, J.: Unsupervised multi-author document decomposition
based on hidden markov model. In: Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics, ACL, Volume 1: Long Papers (2016)
[3] Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering
evaluation metrics based on formal constraints. Information Retrieval 12(4), 461–486
(2009)
[4] Bagnall, D.: Authorship Clustering Using Multi-headed Recurrent Neural Networks. In:
CLEF 2016 Working Notes. CEUR Workshop Proceedings, CLEF and CEUR-WS.org
(2016)
[5] Blei, D.M., Moreno, P.J.: Topic segmentation with an aspect hidden markov model. In:
Proceedings of the 24th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval. pp. 343–348. SIGIR ’01, ACM, New York, NY,
USA (2001), http://doi.acm.org/10.1145/383952.384021
[6] Choi, F.Y.: Advances in Domain Independent Linear Text Segmentation. In: Proceedings
of the 1st North American chapter of the Association for Computational Linguistics
conference. pp. 26–33. Association for Computational Linguistics (2000)
[7] Clarke, C.L., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2009 web
track. Tech. rep., DTIC Document (2009)
[8] Daks, A., Clark, A.: Unsupervised authorial clustering based on syntactic structure. In:
Proceedings of the ACL 2016 Student Research Workshop. pp. 114–118. Association for
Computational Linguistics (2016)
[9] Daniel Karaś, M.S., Sobecki, P.: OPI-JSA at CLEF 2017: Author Clustering and Style
Breach Detection. In: Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR
Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2017)
[10] Eisenstein, J., Barzilay, R.: Bayesian Unsupervised Topic Segmentation. In: Proceedings
of the Conference on Empirical Methods in Natural Language Processing. pp. 334–343.
EMNLP ’08 (2008)
[11] García, Y., Castro, D., Lavielle, V., noz, R.M.: Discovering Author Groups Using a
β-compact Graph-based Clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T.
(eds.) CLEF 2017 Working Notes. CEUR Workshop Proceedings, CLEF and
CEUR-WS.org (2017)
[12] Giannella, C.: An improved algorithm for unsupervised decomposition of a multi-author
document. JASIST 67(2), 400–411 (2016)
[13] Gibbons, J.D., Chakraborti, S.: Nonparametric Statistical Inference, pp. 977–979. Springer
Berlin Heidelberg, Berlin, Heidelberg (2011)
[14] Glavaš, G., Nanni, F., Ponzetto, S.P.: Unsupervised text segmentation using semantic
relatedness graphs. Association for Computational Linguistics (2016)
[15] Glover, A., Hirst, G.: Detecting stylistic inconsistencies in collaborative writing. In: The
New Writing Environment, pp. 147–168. Springer (1996)
[16] Gollub, T., Stein, B., Burrows, S.: Ousting Ivory Tower Research: Towards a Web
Framework for Providing Experiments as a Service. In: Hersh, B., Callan, J., Maarek, Y.,
Sanderson, M. (eds.) 35th International ACM Conference on Research and Development
in Information Retrieval (SIGIR 12). pp. 1125–1126. ACM (Aug 2012)
[17] Gómez-Adorno, H., Aleman, Y., no, D.V., Sanchez-Perez, M.A., Pinto, D., Sidorov, G.:
Author Clustering using Hierarchical Clustering Analysis. In: Cappellato, L., Ferro, N.,
Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Working Notes. CEUR Workshop Proceedings,
CLEF and CEUR-WS.org (2017)
[18] Graham, N., Hirst, G., Marthi, B.: Segmenting documents by stylistic character. Natural
Language Engineering 11(04), 397–415 (2005)
[19] Halvani, O., Graner, L.: Author Clustering based on Compression-based Dissimilarity
Scores. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Working
Notes. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2017)
[20] Hearst, M.A.: Texttiling: Segmenting text into multi-paragraph subtopic passages.
Computational linguistics 23(1), 33–64 (1997)
[21] Holmes, D.I.: The evolution of stylometry in humanities scholarship. Literary and
Linguistic Computing 13(3), 111–117 (1998)
[22] Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining writeprints from anonymous
e-mails for forensic investigation. Digital Investigation 7(1-2), 56–64 (2010)
[23] Karaś, D., Śpiewak, M., Sobecki, P.: OPI-JSA at CLEF 2017: Author Clustering and Style
Breach Detection. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017
Working Notes. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2017)
[24] Kestemont, M., Luyckx, K., Daelemans, W.: Intrinsic Plagiarism Detection Using
Character Trigram Distance Scores. In: Notebook Papers of the 5th Evaluation Lab on
Uncovering Plagiarism, Authorship and Social Software Misuse (PAN). Amsterdam, The
Netherlands (September 2011)
[25] Khan, J.A.: Style Breach Detection: An Unsupervised Detection Model. In: Working
Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF
and CEUR-WS.org (Sep 2017)
[26] Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.:
Skip-thought vectors. In: Advances in neural information processing systems. pp.
3294–3302 (2015)
[27] Kocher, M.: UniNE at CLEF 2016: Author Clustering. In: CLEF 2016 Working Notes.
CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2016)
[28] Kocher, M., Savoy, J.: UniNE at CLEF 2017: Author Clustering. In: Cappellato, L., Ferro,
N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Working Notes. CEUR Workshop
Proceedings, CLEF and CEUR-WS.org (2017)
[29] Koppel, M., Akiva, N., Dershowitz, I., Dershowitz, N.: Unsupervised decomposition of a
document into authorial components. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.)
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.
pp. 1356–1364 (2011)
[30] Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution.
Journal of the American Society for Information Science and Technology 60(1), 9–26
(2009)
[31] Kuznetsov, M., Motrenko, A., Kuznetsova, R., Strijov, V.: Methods for Intrinsic Plagiarism
Detection and Author Diarization. In: Working Notes Papers of the CLEF 2016 Evaluation
Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2016)
[32] Layton, R., Watters, P., Dazeley, R.: Automated unsupervised authorship analysis using
evidence accumulation clustering. Natural Language Engineering 19, 95–120 (2013)
[33] Layton, R., Watters, P., Dazeley, R.: Evaluating authorship distance methods using the
positive silhouette coefficient. Natural Language Engineering 19, 517–535 (2013)
[34] Misra, H., Yvon, F., Jose, J.M., Cappe, O.: Text segmentation via topic modeling: an
analytical study. In: Proceedings of the 18th ACM conference on Information and
knowledge management. pp. 1553–1556. ACM (2009)
[35] Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text
segmentation. Computational Linguistics 28(1), 19–36 (2002)
[36] Ponte, J.M., Croft, W.B.: Text Segmentation by Topic. In: Research and Advanced
Technology for Digital Libraries, pp. 113–125. Springer (1997)
[37] Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd
International Competition on Plagiarism Detection. In: Notebook Papers of the 5th
Evaluation Lab on Uncovering Plagiarism, Authorship and Social Software Misuse (PAN).
Amsterdam, The Netherlands (September 2011)
[38] Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the
Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and
Author Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M.,
Hanbury, A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality,
Multimodality, and Visualization. 5th International Conference of the CLEF Initiative
(CLEF 14). pp. 268–299. Springer, Berlin Heidelberg New York (Sep 2014)
[39] Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing Interaction Logs to
Understand Text Reuse from the Web. In: Fung, P., Poesio, M. (eds.) Proceedings of the
51st Annual Meeting of the Association for Computational Linguistics (ACL 13). pp.
1212–1221. Association for Computational Linguistics (Aug 2013),
http://www.aclweb.org/anthology/P13-1119
[40] Rangel Pardo, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.:
Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In:
Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings,
CLEF and CEUR-WS.org (Sep 2016)
[41] Reynar, J.C.: Statistical Models for Topic Segmentation. In: Proc. of the 37th annual
meeting of the Association for Computational Linguistics on Computational Linguistics.
pp. 357–364 (1999)
[42] Riedl, M., Biemann, C.: Topictiling: a text segmentation algorithm based on lda. In:
Proceedings of ACL 2012 Student Research Workshop. pp. 37–42. Association for
Computational Linguistics (2012)
[43] Safin, K., Kuznetsova, R.: Style Breach Detection with Neural Sentence Embeddings. In:
Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings,
CLEF and CEUR-WS.org (Sep 2017)
[44] Samdani, R., Chang, K.W., Roth, D.: A discriminative latent variable model for online
clustering. In: Proceedings of The 31st International Conference on Machine Learning. pp.
1–9 (2014)
[45] Scaiano, M., Inkpen, D.: Getting more from segmentation evaluation. In: Proceedings of
the 2012 conference of the north american chapter of the association for computational
linguistics: Human language technologies. pp. 362–366. Association for Computational
Linguistics (2012)
[46] Sittar, A., Iqbal, R., Nawab, A.: Author Diarization Using Cluster-Distance Approach. In:
Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings,
CLEF and CEUR-WS.org (Sep 2016)
[47] Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the
American Society for Information Science and Technology 60, 538–556 (2009)
[48] Stamatatos, E.: Intrinsic Plagiarism Detection Using Character n-gram Profiles. In:
Notebook Papers of the 5th Evaluation Lab on Uncovering Plagiarism, Authorship and
Social Software Misuse (PAN). Amsterdam, The Netherlands (September 2011)
[49] Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B.,
Potthast, M.: Clustering by Authorship Within and Across Documents. In: Working Notes
Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and
CEUR-WS.org (Sep 2016), http://ceur-ws.org/Vol-1609/
[50] Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B.,
Potthast, M.: Clustering by Authorship Within and Across Documents. In: Working Notes
Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and
CEUR-WS.org (Sep 2016)
[51] Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Language Resources
and Evaluation 45(1), 63–82 (Mar 2011)
[52] Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Language Resources
and Evaluation 45(1), 63–82 (2011)
[53] Tschuggnall, M., Specht, G.: Plag-Inn: Intrinsic Plagiarism Detection Using Grammar
Trees. In: Proceedings of the 18th International Conference on Applications of Natural
Language to Information Systems (NLDB). pp. 284–289. Springer, Groningen, The
Netherlands (June 2012)
[54] Tschuggnall, M., Specht, G.: Countering Plagiarism by Exposing Irregularities in Authors’
Grammar. In: Proceedings of the European Intelligence and Security Informatics
Conference (EISIC). pp. 15–22. IEEE, Uppsala, Sweden (August 2013)
[55] Tschuggnall, M., Specht, G.: Using Grammar-Profiles to Intrinsically Expose Plagiarism
in Text Documents. In: Proc. of the 18th Conf. of Natural Language Processing and
Information Systems (NLDB). pp. 297–302 (2013)
[56] Tschuggnall, M., Specht, G.: Automatic decomposition of multi-author documents using
grammar analysis. In: Proceedings of the 26th GI-Workshop on Grundlagen von
Datenbanken. CEUR-WS, Bozen, Italy (October 2014)
[57] Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online
messages: Writing-style features and classification techniques. Journal of the American
Society for Information Science and Technology 57(3), 378–393 (2006)