UniNE at PAN-CLEF 2021: Authorship Verification
(Notebook for PAN at CLEF 2021)

Catherine Ikae1
1
    University of Neuchâtel, Switzerland, Avenue du 1er-Mars 26, 2000 Neuchâtel, Switzerland


                                         Abstract
                                         This work proposes to solve the Open-set author verification problem using a Term Frequency Inverse
                                         Document Frequency (TF−IDF) model with a majority-voting ensemble that incorporates five compo-
                                         nent models (machine-learning classifiers). The task is to verify if a given pair of text is written by the
                                         same or different authors. The training sample contains verification cases from previously unseen au-
                                         thors and topics. Transforming this question into a similarity problem, we can determine whether one
                                         or two authors have written a given text pair. Evaluation with 800 unigram features shows an overall
                                         performance of AUC = 0.9041, c@1 = 0.7586, F1−score = 0.8145, F_0.5u = 0.7233, Brier =0.8247, leading
                                         to an overall score = 0.8050.

                                         Keywords
                                         Author verification, Ensemble classifier, TF−IDF, Open-set author verification


1. Introduction
The increase in the volume of online text in communication, blogging, messaging, commen-
taries and entertainment content has generated the need for verification and authentication of
authorship of the corresponding message. This is crucial in application areas such as analysis
of anonymous emails for forensic investigations [1], verification of historical literature [2]
continuous authentication used in cybersecurity [3], detection of changes in writing styles with
Alzheimer patients [4].
Authorship verification is the application of linguistic style learning to detect whether two or
more texts have been written by the same person or by more than one person [5]. By using
prior information from the training dataset, we model the style representing the same author
text as well as different author text used to construct a classifier that can be used to classify
previously unseen text.
In open-set verification, the true author could be absent from the training set. Thus the system
cannot generate a stylistic representation for each distinct author. So, the main question to be
solved is to determine the level of similarity between two stylistic representations to reach the
decision that this pair of texts has been written by the same author.
As the decision must be based on the author style, one can consider extracting stylistic features
from each text. To achieve this, linguistic features reflecting the style must be extracted from
the training dataset. By applying these selected features to the test dataset, the representation

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" catherine.ikae@unine.ch (C. Ikae)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
of each pair of text is possible. In a second step, a classifier must compute a degree of similarity
upon which the final decision can be taken.
In this work, we use the term frequency–inverse document frequency (TF−IDF) to determine
useful features to discriminate between distinct authors [6]. For this purpose, we create a model
using an ensemble of five machine-learning classifiers. By using this method, we take into
account all vocabulary from all texts extracted by n-grams (of words or letters) weighted by
TF−IDF, and using only a small fraction of them to perform the classification, we determine
the optimal performance of the classifier.
The rest of this paper is organized as follows. Section 2 describes the text datasets while Section
3 describes the features used for the classification. Section 4 explains the similarity measure
and Section 5 depicts some of our evaluation results. A conclusion draws the main findings of
our experiments.


2. Corpus
The corpus consists of data obtained from fanfiction.net, a sharing platform for fanfiction that
comes from various topical domains (or ‘fandoms’) [7] [8]. The contents are mainly fictional
texts produced by non-professional authors in the tradition of a specific cultural domain (or
‘fandom’), such as a famous author or a specific influential work. Fanfiction is now abundantly
available on the internet, as the fastest growing form of online writing providing a platform for
data collection. This corpus contains 52,590 text pairs (denoted problems) from which 27,823
pairs correspond to the same author and 24,767 are pairs written by two distinct persons. Each
text excerpt contains, in mean, 2,200 word-tokens.
Based on the training sample of the entire corpus, Figure 1 depicts the top 25 most frequently
used words.
To quantify the differences and similarities that occur when considering same author text pairs
and different author pairs we use the technique of shift graphs. In shift graphs, words are sorted
by their absolute contribution to the difference between text pairs. Word shifts quantify how
each word contributes to the difference between two text pairs [9].
Figure 2 shows the relative occurrence frequency difference between tokens occurring in the
same author pairs. In this graph, the words appear in decreasing order of their occurrence
frequency. As one can see, there are only two tokens with a large difference in this text namely
the two pronouns I, and she. Otherwise the rest of the tokens appear with small differences.
Figure 3 represents the same information as Figure 2 but with a pair of messages written by
two distinct authors. In this case, one can observe that several tokens present large frequency
differences (e.g., lola, joseph, said, tone, with, hikaru, normal). The presence of such numerous
large differences must be interpreted as evidence of the presence of more than one author. For
this reason, we use a difference vector to encode the data.


3. Feature Selection
To determine whether two text chunks have been written by the same author, we need to
determine a text representation that can characterize the stylistic idiosyncrasies of each possible
Figure 1: Word frequency distribution in the corpus Sample


author. Various text surrogates have been suggested, some focusing more on stylistic aspects,
other on semantics (text vectorization).
As a simple and fast solution, and knowing that we are working with 52,590 text pairs, we will
focus on the word uni-gram. In addition, each of them must have a weight computed according
to the frequency (TF) which measures how frequently a term occurs in a document and the
inverse document frequency (IDF) reflecting how important a term is compared to the entire
corpus [10] [11].
TF−IDF is a statistical measure used in information retrieval and text mining that quantifies
the importance of a word in a document by evaluating how relevant a word is to a document in
Figure 2: Same Author Pairs                  Figure 3: Different Author Pairs


a collection of documents [10] [12]. This is done by multiplying two metrics: how many times
a word appears in a document, and the inverse document frequency of the word across a set of
documents.
It works by increasing proportionally to the number of times a word appears in a document,
but is offset by the number of documents that contain the word. So, words that are common in
every document, such as “this”, “what”, and “if”, rank low even though they may appear many
times, since they don’t mean much to that document in particular.
Applying our mathematical notation, the TF-IDF score for the word t in the document d from
the document set (corpus) is calculated as follows:
tf(t,d) = number of occurrences of t in d/ number of tokens in d
df(t) = number of documents in which t occurs
D = Number of documents in the corpus
                                          𝑖𝑑𝑓 (𝑡) = 𝑙𝑜𝑔(𝐷/(𝑑𝑓 (𝑡) + 1))
                        𝑡𝑓 − 𝑖𝑑𝑓 (𝑡, 𝑑) = 𝑡𝑓 (𝑡, 𝑑) * 𝑙𝑜𝑔(𝐷/(𝑑𝑓 (𝑡) + 1))
                          𝑇 𝐹 − 𝐼𝐷𝐹 = 𝑡𝑓 (𝑡, 𝑑) * 𝑙𝑜𝑔(𝐷/(𝑑𝑓 (𝑡) + 1))

An n-gram is a sequence of n-words in a sentence. Here, n is an integer which stands for the
number of words in the sequence. For example, if we put n=1, then it is referred to as a uni-gram.
For our vectorization we apply the uni-gram of TF−IDF for term weighting. Then, based on
the weight associated with each term, one can apply a feature extraction by selecting the top k
words having the largest TF−IDF value.


4. Ensemble Classifier
Ensemble learning could improve the effectiveness of isolated machine learning systems by
combining several models. Such a combined approach should produce better predictive perfor-
mance compared to a single model. In this view, democracy is viewed as a better system than
the tyranny of a single classifier [13].
Our Ensemble model trains different classifiers including:
   1. Linear Discriminant Analysis (LDA) finds a linear combination of features that separates
      two or more classes of objects in order to classify them [14];
   2. Gradient Boosting (GB) which modifies weak learners to propose a strong learner [15];
   3. Extra Trees (EF) is an ensemble learning technique which aggregates the results of multiple
      de-correlated decision trees constructed from the original training sample to obtain its
      classification result [16];
   4. Support Vector Machine (SVM) determines the best decision boundary between vectors
      that belong to a given group and vectors that do not belong to it dividing the space into
      two subspaces [2];
   5. Stochastic gradient descent (SGD) optimises an objective function equipped with the
      parameters of a model and updates parameters for each training sample [17];
   Finally, we integrate these classifiers into an ensemble predictor to leverage complementary
information of the feature representation method encoded by TF−IDF and classifiers. We used
the scikit-learn Python machine learning library that provides an implementation of stacking
for machine learning [6].


5. Evaluation
To conduct experiments with our approach to distinguish between same author pairs and
different author pairs, we used a sample of the provided small training data set split as follows:
10,000 for training and 4,000 for testing. Each of the partitions used is balanced with an equal
number of same author pairs and different author pairs.
   As a performance measure, five evaluation indicators have been used. The area under the
curve (AUC) which measures the ability of systems to assign higher scores to positive cases in
comparison to negative cases. F1−score combines precision and recall into a unique value. c@1
measures the accuracy of binary predictions but also the ability of systems to leave difficult
cases unanswered [1]. F_0.5u a measure that puts more emphasis on deciding same−author
cases correctly [18]. Brier a score used for evaluating the goodness of (binary) probabilistic
classifiers.
   The proposed method is validated by comparing the AUC values of 15 classifiers. Based on
the generated output, an ensemble is made by combining classifiers with consistent high AUC
values. As seen on Table 1 increasing the unigram TF IDF values from k = 100 to 1000, we see
consistent good performance in Linear Discriminant Analysis (LDA), Gradient Boosting (GB),
Extra Trees (EF), Support Vector Machine (SVM), Stochastic gradient descent (SGD).
   These chosen classifiers are combined using the hard voting (majority voting), every individual
classifier votes for a class, and the majority determines the predicted class. With k=800, this is
the point where most of the classifiers are at their maximum AUC values. The chosen classifiers
had at least an AUC value of 0.87.


Figure 4: Distribution of AUC values for the two classes, same or distinct authors (k = 800)


  With the results obtained from the Ensemble classifier, we view the distribution of AUC
results for the two classes, namely “same author” and “different authors”, As one can see in
Figure 4, “same author” distribution presents a higher similarity mean (mean: 0.64, sd: 0.19)
Table 1
Evaluation based on different feature sizes
                                          Number of TF−IDF unigram features
       Classifiers        100    200     300  400    500   600    700    800      900    1000
       LDA                0.85 0.87 0.88 0.89 0.89 0.88 0.88 0.89                 0.88   0.87
       GradientBoost      0.84 0.86 0.87 0.87 0.87 0.88 0.87 0.88                 0.88   0.88
       ExtraTrees         0.84 0.85 0.86 0.86 0.86 0.86 0.86 0.87                 0.86   0.86
       KNN                0.76 0.76 0.77 0.76 0.74 0.73 0.73 0.70                 0.70   0.68
       GaussianNB         0.82 0.82 0.84 0.83 0.82 0.82 0.81 0.81                 0.79   0.77
       MultinomialNB 0.67 0.71 0.73 0.74 0.74 0.73 0.73 0.74                      0.72   0.72
       BernoulliNB        0.54 0.66 0.72 0.74 0.76 0.75 0.76 0.76                 0.77   0.78
       DecisionTree       0.63 0.64 0.64 0.64 0.63 0.63 0.64 0.64                 0.64   0.63
       RandomForest       0.83 0.85 0.86 0.86 0.87 0.87 0.87 0.87                 0.87   0.87
       LogisticReg        0.84 0.85 0.87 0.87 0.87 0.87 0.87 0.87                 0.86   0.86
       AdaBoost           0.82 0.83 0.85 0.85 0.85 0.85 0.85 0.84                 0.85   0.84
       Bagging            0.78 0.78 0.78 0.79 0.78 0.78 0.77 0.78                 0.78   0.76
       SGD                0.84 0.85 0.87 0.87 0.87 0.87 0.87 0.87                 0.86   0.86
       XGB                0.84 0.86 0.87 0.88 0.87 0.87 0.88 0.88                 0.87   0.87
       SVM                0.85 0.88 0.89 0.89 0.89 0.89 0.89 0.89                 0.88   0.88
       Ensemble           0.85 0.87 0.89 0.89 0.90 0.90 0.90 0.90                 0.90   0.89

Table 2
Official Evaluation with (k = 800)
                                                    800 TF−IDF unigram features
     Classifiers                      AUC      c@1      F1−score F_0.5u        Brier     overall
     Ensemble (Early Bird)            0.904    0.71       0.769     0.685     0.821       0.778
     Ensemble (Final )               0.9041   0.7586      0.8145   0.7233     0.8247     0.8050


representing mainly the higher values while the “different authors” distribution (mean: 0.32, sd:
0.18) mainly contains the lower values.
   The final evaluation result is obtained on the TIRA platform [19] is exposed in Table 2. These
results were obtained with a model trained on the 52,590 text pairs (small training set) and tested
on 19,999 text pairs (official test set). By considering all results as computed by the proposed
system, we achieve an overall score of 0.778 in the early bird submission without mapping any
score to 0.5. The second run was made by adjusting some score to 0.5 based on the analysis of
the similarity distribution. The values greater than 0.4 but less than 0.6 ( 0.4 < x < 0.6) where
equated to 0.5 leading to an improved overall score of 0.8050.


6. Conclusion
This report has presented the proposed solution for open-set author verification at PAN 2021.
Our approach is based on modeling the fandom pairs using word unigram TF−IDF features
with a majority-voting ensemble that incorporates five machine-learning classifiers. With the
ensemble classifier, we achieved an overall score of 0.8050. This simple approach proves to be
effective in distinguishing text written by the same author and text written by different authors.
For future work, the idea is to include longer word n-gram models to enrich the current feature
and to hopefully boost performance of the current technique.


References
 [1] F. Iqbal, L. A. Khan, B. Fung, M. Debbabi, E-mail authorship verification for forensic
     investigation, 2010, pp. 1591–1598. doi:10.1145/1774088.1774428.
 [2] C. Cortes, V. Vapnik, Support vector networks, Machine Learning 20 (1995) 273–297.
 [3] T. Neal, K. Sundararajan, D. Woodard, Exploiting linguistic style as a cognitive biometric
     for continuous verification, in: 2018 International Conference on Biometrics (ICB), 2018,
     pp. 270–276. doi:10.1109/ICB2018.2018.00048.
 [4] G. Hirst, V. Feng, Changes in style in authors with alzheimer’s disease, English Studies 93
     (2012) 357 – 370.
 [5] M. Koppel, J. Schler, S. Argamon, Y. Winter, The “fundamental problem” of authorship
     attribution, English Studies 93 (2012) 284 – 291.
 [6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
     M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
     Learning Research 12 (2011) 2825–2830.
 [7] J. Bevendorff, B. Ghanem, A. Giachanou, M. Kestemont, E. Manjavacas, I. Markov, M. May-
     erl, M. Potthast, F. Rangel, P. Rosso, G. Specht, E. Stamatatos, B. Stein, M. Wiegmann,
     E. Zangerle, Overview of pan 2020: Authorship verification, celebrity profiling, profiling
     fake news spreaders on twitter, and style change detection, in: T. T. S. V. H. J. C. L. C. E. A.
     N. L. C. N. F. Avi Arampatzis, Evangelos Kanoulas (Ed.), 11th International Conference of
     the CLEF Association (CLEF 2020), Springer, 2020. URL: http://ceur-ws.org/Vol-2696/.
 [8] M. Kestemont, I. Markov, E. Stamatatos, E. Manjavacas, J. Bevendorff, M. Potthast, B. Stein,
     Overview of the Authorship Verification Task at PAN 2021, in: CLEF 2021 Labs and
     Workshops, Notebook Papers, CEUR-WS.org, 2021.
 [9] R. J. Gallagher, M. Frank, L. Mitchell, A. J. Schwartz, A. J. Reagan, C. Danforth, P. Dodds,
     Generalized word shift graphs: a method for visualizing and explaining pairwise compar-
     isons between texts, EPJ Data Science 10 (2021) 1–29.
[10] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process.
     Manag. 24 (1988) 513–523.
[11] S. Qaiser, R. Ali, Text mining: Use of tf-idf to examine the relevance of words to documents,
     International Journal of Computer Applications 181 (2018) 25–29.
[12] M. Kestemont, J. Stover, M. Koppel, F. Karsdorp, W. Daelemans, Authenticating the writings
     of julius caesar, Expert Syst. Appl. 63 (2016) 86–96.
[13] M. Bramer, Ensemble Classification, Springer London, London, 2013, pp. 209–220. URL:
     https://doi.org/10.1007/978-1-4471-4884-5_14. doi:10.1007/978-1-4471-4884-5_14.
[14] P. Xanthopoulos, P. M. Pardalos, T. B. Trafalis, Linear Discriminant Analysis, Springer New
     York, New York, NY, 2013, pp. 27–33. URL: https://doi.org/10.1007/978-1-4419-9878-1_4.
     doi:10.1007/978-1-4419-9878-1_4.
[15] R. E. Schapire, The strength of weak learnability, in: Machine Learning, 1990.
[16] P. Geurts, Extremely randomized trees, in: MACHINE LEARNING, 2003, p. 2006.
[17] L. Bottou, F. E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning,
     SIAM Review 60 (2018) 223–311. URL: https://doi.org/10.1137/16M1080173. doi:10.1137/
     16M1080173. arXiv:https://doi.org/10.1137/16M1080173.
[18] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Generalizing unmasking for short texts, in:
     Proceedings of the 2019 Conference of the North American Chapter of the Association for
     Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
     Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 654–
     659. URL: https://www.aclweb.org/anthology/N19-1068. doi:10.18653/v1/N19-1068.
[19] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
     2019, pp. 123–160. doi:10.1007/978-3-030-22948-1\_5.
[20] J. Bevendorff, B. Chulvi, G. L. D. L. P. Sarracén, M. Kestemont, E. Manjavacas, I. Markov,
     M. Mayerl, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wol-
     ska, , E. Zangerle, Overview of PAN 2021: Authorship Verification,Profiling Hate Speech
     Spreaders on Twitter,and Style Change Detection, in: 12th International Conference of
     the CLEF Association (CLEF 2021), Springer, 2021.
[21] F. Rangel, G. L. D. L. P. Sarracén, B. Chulvi, E. Fersini, P. Rosso, Profiling Hate Speech
     Spreaders on Twitter Task at PAN 2021, in: A. J. M. M. F. P. Guglielmo Faggioli, Nicola Ferro
     (Ed.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021.
[22] F. Rangel, M. Franco-Salvador, P. Rosso, A Low Dimensionality Representation for Lan-
     guage Variety Identification, in: International Conference on Intelligent Text Processing
     and Computational Linguistics, Springer, 2016, pp. 156–169.