=Paper=
{{Paper
|id=Vol-1866/paper_59
|storemode=property
|title=Author Clustering based on Compression-based Dissimilarity Scores
|pdfUrl=https://ceur-ws.org/Vol-1866/paper_59.pdf
|volume=Vol-1866
|authors=Oren Halvani,Lukas Graner
|dblpUrl=https://dblp.org/rec/conf/clef/HalvaniG17
}}
==Author Clustering based on Compression-based Dissimilarity Scores==
Author Clustering Using Compression-based
Dissimilarity Scores
Notebook for PAN at CLEF 2017
Oren Halvani? and Lukas Graner
Fraunhofer Institute for Secure Information Technology SIT
Rheinstrasse 75, 64295 Darmstadt, Germany
{FirstName.LastName}@SIT.Fraunhofer.de
Abstract The PAN 2017 Author Clustering task examines the two application
scenarios complete author clustering and authorship-link ranking. In the first sce-
nario, one must identify the number (k) of different authors within a document
collection and assign each document to exactly one of the k clusters, where each
cluster corresponds to a different author. In the second scenario, one must esta-
blish authorship links between documents in a cluster and provide a list of docu-
ment pairs, ranked according to a confidence score. We present a simple scheme
to handle both scenarios. In order to group the documents by their authors, we
use k-Medoids, where the optimal k is determined through the computation of
silhouettes. To determine links between the documents in each cluster, we ap-
ply a predefined compressor as well as a dissimilarity measure. The resulting
compression-based dissimilarity scores are then used to rank all document pairs.
The proposed scheme does not require (text-)preprocessing, feature engineering
or hyperparameter optimization, which are often necessary in author clustering
and/or other related fields. However, the achieved results indicate that there is
room for improvement.
1 Introduction
Author clustering (AC) is a relatively new sub-discipline in the field of authorship ana-
lysis and is offered again by PAN [10] this year as a shared task1 . Given a collection of
documents, the goal of AC is to group documents written by the same author, such that
each cluster corresponds to a different author [15]. Formally, the AC problem can be
defined as follows: Given a set of n documents D = {D1 , D2 , . . . , Dn } the task is to
form a clustering C = {C1 , C2 , . . . Ck } regarding D such that each cluster C comprises
documents {Da , Db , Dc , . . .} written by the same author A ∈ A, where A denotes a set
of k different authors.
The PAN 2017 Author Clustering task examines two application scenarios: complete
author clustering and authorship-link ranking. In the first scenario, one must identify k
?
Corresponding author.
1
A shared task is an event, where researchers and practitioners aim to solve or at least make
progress on open academic problems.
(the number of different authors within D) while assigning each D ∈ D exactly to one
cluster C ∈ C = {C1 , C2 , . . . Ck }. In the second scenario, one must establish authorship
links between the documents {Da , Db , Dc , . . .} in each cluster C and provide a list of
document pairs (Da , Db ), ranked according to a confidence score λ ∈ [0; 1], where λ
indicates how likely Da and Db are to be written by the same author.
We present a simple AC approach based on the k-Medoids algorithm and the com-
putation of so-called silhouettes to determine the optimal k. Instead of using distances
computed through well-known metrics such as Manhattan or Euclid, we decided to
experiment with compression-based dissimilarity scores. To compute these scores we
apply a compression-based model consisting of a predefined compressor and a dissi-
milarity measure designed for compressed text files. Compression-based models have
been applied widely across different authorship analysis tasks including authorship at-
tribution [5,9] or authorship verification [2,4,16], as well as in other related disciplines
such as text classification [3,8,12] and have been shown to be highly effective compared
to state-of-the-art approaches, not only in terms of recognition rates but also in terms
of runtime. In [4, Table 4] for example, the authors have shown that their compression-
based authorship verification method performed very similar to the winning approach
[1] of the PAN 2015 Author Identification task [14], where it only required 7 seconds
instead of 21 hours.
Our approach has a number of benefits. First, it does not require the explicit definition,
selection and/or extraction of features as these are implicitly handled by the compres-
sion model. Second, our approach does not rely on a threshold which is often mandatory
to judge whether two documents are written by the same author. Third, our approach
does not involve machine learning methods and, thus, also not requires hyperparame-
ter optimization (which is typically needed for classification/recognition). Fourth, the
approach does not even need a specific preprocessing regarding the documents, which
further reduces its complexity.
2 Our approach
This section describes our approach, which is broken down into both scenarios complete
author clustering and authorship-link ranking.
2.1 Task 1: Complete author clustering
Compressing distances: As mentioned in Section 1 we waive the usage of a traditional
distance function and instead use a compression-based dissimilarity measure. Given this
measure, we can determine the "nearness" between two documents. However, before we
can use this measure we require a compressor to obtain the compressed representation
of the documents. Here, we decided to use one of the most powerful available compres-
sor PPM2 (Prediction by Partial Matching), which has been used excessively in various
fields and domains and led to promising results. Once the documents are compressed
via PPM, we apply a dissimilarity function to measure how (dis-)similar two documents
are to each other. As a dissimilarity function we chose the CBC (Compression-based
Cosine) measure, proposed by Sculley and Brodley [13], which is defined as:
C(x) + C(y) − C(xy)
CBC(x, y) = 1 − p . (1)
C(x)C(y)
Here, x and y denote two documents, and xy their concatenation. With C(·) we de-
note the length of a compressed document, which aims to approximate its Kolmogorov
Complexity. The resulting value is in the interval [0; 1].
Clustering via k-Medoids: In the PAN 2017 Author Clustering task the simplification
is taken that all documents are single-authored. In practice this is not very realistic as
it can often occur that documents (or text fragments such as paragraphs, sentences or
phrases) are authored by different authors. However, we take advantage of the fact that
all documents within the PAN corpora are single-authored and chose a simple parti-
tional clustering algorithm that generates disjoint clusters. As a clustering algorithm
we decided to use k-Medoids (proposed by Kaufman and Rousseeuw [6]), which is
strongly related to the well-known k-Means method. However, in k-Medoids each clus-
ter is represented by one of the objects in the cluster (the medoid), while in k-Means
each cluster is represented by the center of the cluster (the mean).
The most common realization of the k-Medoids clustering method is the PAM (Par-
titioning Around Medoids, [7]) algorithm, which we we slightly modified by using a
compression-based dissimilarity measure rather than a distance function. The modified
algorithm is given in Algorithm 1.
Measure the quality of the clustering. Since for each problem ρ the number of authors k
is not known beforehand, a strategy is needed to measure the clustering quality, in order
to determine the "optimal" k. Our strategy is based on the computation of silhouettes
(proposed by Rousseeuw [11]). The idea is to perform n (= number of documents in a
problem ρ) clustering iterations3 which results in n − 1 clusterings C2 , C3 , Cn via k-
Medoids and to pick the k for which the clustering Ck yields the maximum silhouette
coefficient SC , defined as:
1 X
SC = s(D)
nC
D∈C
Here, The calculation of a silhouette value s(D) is calculated as follows:
2
In fact we use the PPMd variant, implemented in the C# library SharpCompress, offered by
Adam Hathcock available under https://github.com/adamhathcock/sharpcompress. As a con-
crete implementation we used Michael Bone’s port of Dmitry Shkarin’s PPMd Variant I Revi-
sion 1.
3
Note that we skip the case n = 1, as we assume that for each problem ρ there are two or more
corresponding authors.
Algorithm 1: k-Medoids, adapted to compression-based dissimilarity scores.
Input: Number of clusters: k; document collection D = {D1 , D2 , . . . , Dn }; dissimilarity
measure: d(x, y)
Output: A clustering comprising k clusters: C = {C1 , C2 , . . . , Ck }
/* 0.) Helper methods: */
/* A dissimilarity score between two documents, weighted by the
sum of all dissimilarity scores between the first document and
each other document within the collection: */
d(D1 , D2 )
wd(D1 , D2 ) = P 0
;
D 0 ∈D d(D1 , D )
/* The sum of weighted dissimilarity scores regarding a document:
*/
X
w(D) = wd(D0 , D);
D 0 ∈D
/* The sum of the minimum dissimilarity scores between each
non-medoid n and a medoid m. Here, N refers to the set of all
non-medoids, while M denotes the set of all medoids. */
X
totalCost(N, M ) = min (d(n, m));
m∈M
n∈N
/* 1.) Initialize medoids: */
Dsorted ← D sorted ascending by w(·);
M ← first k elements of Dsorted ;
/* 2.) Minimize total cost by finding a more suitable medoid at
each step. Repeat until cost cannot be further decreased. */
label minimizeStep:
foreach m ∈ M do
foreach n ∈ D \ M do
M 0 ← M \ {m} ∪ {n};
if totalCost(D \ M 0 , M 0 ) < totalCost(D \ M, M ) then
M ← M 0;
goto minimizeStep;
/* 3.) Assign non-medoids to their nearest medoids to form
clusters. */
C ← ∅;
foreach mi ∈ M do
Ci ← {D | D ∈ D ∧ mi = arg min(d(D, m))};
m∈M
C ← C ∪ {Ci };
return C = {C1 , C2 , . . . , Ck };
1. Let s(D) ∈ [−1; 1] denote a silhouette value for a document D ∈ D, which was
assigned to a cluster Ca . We first compute a(D) = the average dissimilarity of D to
all other documents in the same cluster Ca .
2. For every other cluster C = 6 Ca , we calculate the average dissimilarity b(d) =
dist(D, C) between D and each document in C. The cluster with the smallest average
dissimilarity to D is denoted by Cb .
3. Finally, we compute s(D) as follows: For the case that the initial cluster comprises
only one document (|Ca | = 1) or that a = b holds, we set s(D) = 0. For the case
a(D)
that a(D) < b(D) we calculate s(D) = 1 − b(D) and otherwise s(D) = b(D) a(D) − 1.
2.2 Task 2: Authorship-link ranking
In order to establish authorship-links within each cluster, we first modified the CBC
measure in order to calculate similarity (instead of dissimilarity) scores as follows:
C(x) + C(y) − C(xy)
CBCsim (x, y) = p . (2)
C(x)C(y)
Given CBCsim (x, y), we applied it on each document pair within a cluster and sorted
the resulting list in a descending order. Note that the authorship-link ranking step could
be also performed through an arbitrary authorship verification method. However, we
tried to keep the approach as compact as possible. Therefore, we only made use of
PPM to compress the documents and calculate their similarity to each other by using
CBCsim (·).
3 Evaluation
Since our approach does not require any type of training, there was no need to split the
given training corpus into two sub-sets in order to apply hyperparameter learning on
one set and the evaluation on the second set. Besides the PAN 2017 AC training corpus
we also used the training corpus from PAN 20164 . The results regarding both corpora
are listed in Tables 1-6.
3.1 PPM: Optional parametrization
As stated in this papers, our scheme does not require any type of training. However,
this is only true, because we used a predefined (hard coded) parametrization regarding
the PPM compressor within the involved C# library. In fact, there are two tweakable
parameters (AllocatorSize and ModelOrder) that aim to improve the compres-
sion results. For AllocatorSize, we could not observe any influence regarding the
author clustering results, irrespective of which values were used. Therefore, we waived
4
Note that at the time this paper was written, the test corpus was not publicly released.
Table 1. PAN 2017 Author Clustering Training Dataset 2017-02-15 [English]
Problem Language Genre F-Bcubed R-Bcubed P-Bcubed Av-Precision
problem001 en articles 0,47481 0,49444 0,45667 0,11564
problem002 en articles 0,48596 0,52333 0,45357 0,065881
problem003 en articles 0,4979 0,50606 0,49 0,041557
problem004 en articles 0,6467 0,875 0,51288 0,25755
problem005 en articles 0,42331 0,75 0,29487 0,047852
problem006 en articles 0,44589 0,525 0,3875 0,095395
problem007 en articles 0,56866 0,475 0,70833 0,16332
problem008 en articles 0,53429 0,39444 0,82778 0,20633
problem009 en articles 0,54495 0,48611 0,62 0,093589
problem010 en articles 0,48862 0,34286 0,85 0,099382
problem011 en reviews 0,63504 0,55192 0,74762 0,25705
problem012 en reviews 0,52277 0,4 0,75429 0,137
problem013 en reviews 0,47764 0,34619 0,77 0,033695
problem014 en reviews 0,54136 0,57778 0,50926 0,044189
problem015 en reviews 0,51064 0,34286 1 0,12121
problem016 en reviews 0,7094 0,71048 0,70833 0,30976
problem017 en reviews 0,72956 0,85333 0,63714 0,33655
problem018 en reviews 0,60968 0,54 0,7 0,073084
problem019 en reviews 0,52027 0,62619 0,445 0,063617
problem020 en reviews 0,51891 0,60667 0,45333 0,023188
Average 0,544318 0,546383 0,6163285 0,12929195
to train an "optimal" value for this parameter and, instead, used the default setting of
224 = 16, 777, 216.
In contrast, we observed for ModelOrder slight variations regarding the author cluste-
ring results, during initial experiments. Hence, we applied our scheme on both training
datasets (PAN 2016 and PAN 2017), in order to consider, if it make sense to discard trai-
ning and, instead, to use the default parameter setting of 6 (in total there are 15 possible
values, ranging from 2 to 16). As can be inferred from the results (given in Figure 1),
the default parameter setting is very close to the average across all possible parameter
settings. As a consequence, we decided to discard the training for this parameter and to
use the default (hard-coded) setting.
3.2 Other experiments
Besides k-Medoids we also experimented with the density-based clustering method
DBSCAN (Density-Based Spatial Clustering of Applications with Noise.), where we
also used compression-based dissimilarity scores rather than distances. Our intention
was to eliminate the determination of k, not only to reduce the approach’s complexity,
but also to save runtime as only one scan through the documents is needed. However,
instead of the expected reduction it added more complexity as both density parameters ε
(maximum radius of the neighborhood) and minPts (minimum number of points requi-
red to form a dense region) require training. In addition, it turned out that after training
Table 2. PAN 2017 Author Clustering Training Dataset 2017-02-15 [Dutch]
Problem Language Genre F-Bcubed R-Bcubed P-Bcubed Av-Precision
problem021 nl articles 0,54377 0,55333 0,53452 0,093142
problem022 nl articles 0,56078 0,91 0,40526 0,28237
problem023 nl articles 0,44267 0,565 0,36389 0,020261
problem024 nl articles 0,5565 0,39351 0,95 0,2424
problem025 nl articles 0,65494 0,59722 0,725 0,15419
problem026 nl articles 0,43757 0,33333 0,63667 0,079443
problem027 nl articles 0,68961 0,56905 0,875 0,26102
problem028 nl articles 0,68785 0,79 0,60909 0,16974
problem029 nl articles 0,59828 0,48654 0,77667 0,16066
problem030 nl articles 0,5784 0,41905 0,93333 0,19865
problem031 nl reviews 0,6 0,5 0,75 0,12189
problem032 nl reviews 0,51471 0,4375 0,625 0,03189
problem033 nl reviews 0,46684 0,52778 0,41852 0,071347
problem034 nl reviews 0,64865 0,75 0,57143 0,18214
problem035 nl reviews 0,5916 0,55 0,64 0,12552
problem036 nl reviews 0,54637 0,47222 0,64815 0,031937
problem037 nl reviews 0,66009 0,675 0,64583 0,14429
problem038 nl reviews 0,43555 0,61667 0,33667 0,075441
problem039 nl reviews 0,40594 0,5 0,34167 0,013461
problem040 nl reviews 0,49321 0,74167 0,36944 0,076857
Average 0,5556665 0,5693935 0,607807 0,12683245
DBSCAN still performed worse than k-Medoids on both training corpora PAN-2016
and PAN-2017. On average, DBSCAN achieved only 80% of k-Medoids’ F-Bcubed
scores. Therefore, we discarded this approach.
4 Conclusions
We proposed an experimental approach to cluster texts by their authors by using k-
Medoids with compression-based dissimilarity scores. On the plus side, our approach
is quite simple and entirely independent from feature engineering, threshold determi-
nation (regarding the authorship-link ranking sub-task), (text-) preprocessing as well
as hyperparameter optimization. On the negative side, the proposed approach does not
perform very well, which might have a number of reasons. We noticed for example (af-
ter the submission deadline of the software) that the compression-based dissimilarity
measure does not fulfill even one of the required properties of a real distance-based me-
tric, which are identity5 , symmetry6 and triangle inequality. Especially the symmetry
5
For example, when we compress a document x and apply CBC(x, x) we obtain as a dissimi-
larity measure the score : 0.117647. This value is somehow confusing as we might expect 0
when we are used to work with real distance metrics.
6
For example, consider we have two different documents x and y. Computing CBC(x, y) re-
turns 0.6459, while CBC(y, x) returns 0.6852.
Table 3. PAN 2017 Author Clustering Training Dataset 2017-02-15 [Greek]
Problem Language Genre F-Bcubed R-Bcubed P-Bcubed Av-Precision
problem041 gr articles 0,42798 0,55667 0,34762 0,016841
problem042 gr articles 0,49535 0,57083 0,4375 0,051153
problem043 gr articles 0,52746 0,695 0,425 0,062329
problem044 gr articles 0,47622 0,58833 0,4 0,047589
problem045 gr articles 0,42076 0,43 0,4119 0,021957
problem046 gr articles 0,34142 0,415 0,29 0,02013
problem047 gr articles 0,65524 0,725 0,59773 0,16782
problem048 gr articles 0,40556 0,485 0,34848 0,049585
problem049 gr articles 0,44287 0,34 0,635 0,063383
problem050 gr articles 0,47897 0,40833 0,57917 0,10167
problem051 gr reviews 0,48119 0,5375 0,43556 0,07551
problem052 gr reviews 0,41693 0,93333 0,26842 0,16368
problem053 gr reviews 0,49617 0,3975 0,66 0,164
problem054 gr reviews 0,46805 0,40889 0,54722 0,061924
problem055 gr reviews 0,59742 0,72778 0,50667 0,23549
problem056 gr reviews 0,53797 0,635 0,46667 0,075371
problem057 gr reviews 0,49493 0,79167 0,36 0,097521
problem058 gr reviews 0,67832 0,6125 0,76 0,19889
problem059 gr reviews 0,80721 0,93333 0,71111 0,59345
problem060 gr reviews 0,62222 0,7 0,56 0,13118
Average 0,513612 0,594583 0,4874025 0,11997365
Table 4. PAN 2016 Author Clustering Training Dataset 2016-02-17 [English]
Problem Language Genre F-Bcubed R-Bcubed P-Bcubed Av-Precision
problem001 en articles 0,30386 0,82133 0,18641 0,009715
problem002 en articles 0,42318 0,64302 0,31537 0,019366
problem003 en articles 0,28383 0,96 0,16653 0,0090597
problem004 en reviews 0,20851 0,76667 0,12067 0,0028351
problem005 en reviews 0,17956 0,94583 0,099198 0,0039141
problem006 en reviews 0,30265 0,67875 0,19474 0,010734
Average 0,283598333 0,8026 0,180486333 0,00927065
Table 5. PAN 2016 Author Clustering Training Dataset 2016-02-17 [Dutch]
Problem Language Genre F-Bcubed R-Bcubed P-Bcubed Av-Precision
problem007 nl articles 0,36444 0,90643 0,22807 0,0014286
problem008 nl articles 0,60859 0,62765 0,59064 0,042869
problem009 nl articles 0,3755 0,80117 0,24522 0,013505
problem010 nl reviews 0,3779 0,64833 0,26667 0,0087443
problem011 nl reviews 0,25545 0,72 0,15527 0,0017606
problem012 nl reviews 0,30055 0,91 0,18 0
Average 0,380405 0,76893 0,277645 0,011384583
Table 6. PAN 2016 Author Clustering Training Dataset 2016-02-17 [Greek]
Problem Language Genre F-Bcubed R-Bcubed P-Bcubed Av-Precision
problem013 gr articles 0,29726 0,7 0,1887 0,017252
problem014 gr articles 0,26916 0,88 0,15888 0,024111
problem015 gr articles 0,21535 0,93939 0,12162 0,010769
problem016 gr reviews 0,22652 0,93939 0,12879 0,0012533
problem017 gr reviews 0,41019 0,86818 0,26853 0,033574
problem018 gr reviews 0,34012 0,92727 0,20825 0,022682
Average 0,2931 0,875705 0,179128333 0,01827355
0.542
16 0.342
0.543
15 0.349
0.542
14 0.351
0.544
13 0.346
0.543
12 0.343
0.552
11
PPMd Model Order
0.344
0.551
10 0.344
0.545
9 0.337
0.552
8 0.337
0.535
7 0.330
0.545
6 0.329
0.547
5 0.308
0.543
4 0.302
0.539
3 0.270
0.529
2 0.207
Avg. 0.323 Avg. 0.543
0 0.2 0.4 0.6
F-Bcubed score
Pan 16 Corpus
Pan 17 Corpus
Figure 1. Author clustering results for the 15 different ModelOrder parameter settings.
property leads to an unexpected behavior, due to the implication that the order of the
compressed documents matters when applying the compression-based dissimilarity on
them. As future work we therefore need to examine for which cases compression-based
models are applicable. Currently, we believe that they are well suited for establishing
authorship-link rankings, but for clustering alternative strategies might be more promi-
sing (and reliable).
Acknowledgments This work was supported by the German Federal Ministry of Edu-
cation and Research (BMBF) in the funded project EWV (award number: 13N13500).
References
1. Bagnall, D.: Author Identification Using Multi-headed Recurrent Neural Networks. In:
Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse,
France, September 8-11, 2015. [14], http://ceur-ws.org/Vol-1391/150-CR.pdf
2. Cerra, D., Datcu, M., Reinartz, P.: Authorship Analysis Based on Data Compression.
Pattern Recognition Letters 42, 79 – 84 (2014),
http://www.sciencedirect.com/science/article/pii/S0167865514000336
3. Coutinho, D.P., Figueiredo, M.A.T.: Text Classification Using Compression-Based
Dissimilarity Measures. IJPRAI 29(5) (2015),
http://dx.doi.org/10.1142/S0218001415530043
4. Halvani, O., Winter, C., Graner, L.: Authorship Verification based on Compression-Models.
ArXiv e-prints (Jun 2017)
5. Jr., W.O., Justino, E., Oliveira, L.: Comparing Compression Models for Authorship
Attribution. Forensic Science International 228(1âĂŞ3), 100–104 (2013),
http://www.sciencedirect.com/science/article/pii/S0379073813000923
6. Kaufman, L., Rousseeuw, P.J.: Clustering by Means of Medoids. Statistical Data Analysis
Based on the L1-Norm and Related Methods pp. 405–416 (1987)
7. Kaufman, L., Rousseeuw, P.J.: Partitioning around Medoids (Program PAM). Finding
Groups in Data: An Introduction to Cluster Analysis pp. 68–125 (1990)
8. Marton, Y., Wu, N., Hellerstein, L.: On Compression-Based Text Classification. In:
Advances in Information Retrieval, 27th European Conference on IR Research, ECIR 2005,
Santiago de Compostela, Spain, March 21-23, 2005, Proceedings. pp. 300–314 (2005),
http://dx.doi.org/10.1007/978-3-540-31865-1_22
9. Nagaprasad, S., Reddy, P.V., Babu, A.V.: Authorship Attribution based on Data
Compression for Telugu Text. International Journal of Computer Applications 110(1), 1–5
(2015)
10. Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview of
PAN’17: Author Identification, Author Profiling, and Author Obfuscation. In: Jones, G.,
Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N. (eds.)
Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International
Conference of the CLEF Initiative (CLEF 17). Springer, Berlin Heidelberg New York (Sep
2017)
11. Rousseeuw, P.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster
Analysis. J. Comput. Appl. Math. 20(1), 53–65 (Nov 1987),
http://dx.doi.org/10.1016/0377-0427(87)90125-7
12. Saikrishna, V., Dowe, D.L., Ray, S.: Statistical Compression-based Models for Text
Classification. In: 2016 Fifth International Conference on Eco-friendly Computing and
Communication Systems (ICECCS). pp. 1–6 (Dec 2016)
13. Sculley, D., Brodley, C.E.: Compression and Machine Learning: A New Perspective on
Feature Space Vectors. In: 2006 Data Compression Conference (DCC 2006), 28-30 March
2006, Snowbird, UT, USA. pp. 332–332. IEEE Computer Society (2006),
http://dx.doi.org/10.1109/DCC.2006.13
14. Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M.,
Stein, B.: Overview of the Author Identification Task at PAN 2015. In: CLEF 2015
Evaluation Labs and Workshop – Working Notes Papers. CEUR, CEUR, Toulouse, France
(2015/09/10 2015)
15. Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B.,
Potthast, M.: In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes
Papers of the CLEF 2017 Evaluation Labs
16. Veenman, C.J., Li, Z.: Authorship Verification with Compression Features. In: Forner, P.,
Navigli, R., Tufis, D., Ferro, N. (eds.) Working Notes for CLEF 2013 Conference , Valencia,
Spain, September 23–26, 2013. CEUR Workshop Proceedings, vol. 1179. CEUR-WS.org
(2013), http://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-VeenmanEt2013.pdf