Different Encoding Approaches for Authorship
Verification
Stefanos Konstantinou1 , Jinqiao Li1 and Angelos Zinonos1
1
    University of Zurich, Rämistrasse 71, 8006 Zürich, Switzerland


                                         Abstract
                                         PAN is a series of scientific events and shared tasks which focuses on digital text forensics and stylometry.
                                         In previous editions of PAN, the effectiveness of authorship verification technology in several languages
                                         and text genres was tackled, and this year the content shifted to cross discourse type pairs of text. The
                                         purpose of this paper is to test various Transformer based encoder models, using Cross encoder and
                                         Bi-Encoder approaches. The results illustrate a decent performance, reaching an F1 score of 80% on
                                         the best model. Further experimentation was performed on the training dataset, which resulted in no
                                         positive outcome.

                                         Keywords
                                         NLP, Author Verification, PAN22, Pre-trained model, Text information, Classification


1. Introduction
This paper presents our approach for the Authorship Verification Shared Task [1] at PAN 2022
[2]. The goal of the task is to decide whether two texts have been written by the same author
based on comparing their writing styles. Compared to the tasks in previous editions, this
year’s aim is to focus on more challenging scenarios, as this will allow studying the ability
of stylometric approaches to capture authorial characteristics even when different discourse
types are imposed. Discriminating between documents by stylometric means, could indicate
significant boosts in the area of Cyber Security and Criminology. Moreover, people producing
or writing hate speech on social platforms anonymously could be identified from even their
business emails if the task is successfully solved thus, they can be held accountable.
    The dataset contains essays, emails, text messages and business memos in English. The
purpose of this task is to develop a method that will compare a pair of texts consisting of
different discourse types and to predict whether they are written by the same author or not.
    After analyzing related work, we have decided to follow a transformer-based approach since
it is widely and successfully used in Natural Language Processing. Our goal is to experiment
with a wider variety of transformer models than what has been tested before.


2. Related Work
[3] gives an overview of the approaches from last year’s instance of the PAN CLEF author
CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ stefanos.konstantinou@uzh.ch (S. Konstantinou); jinqiao.li@uzh.ch (J. Li); angelos.zinonos@uzh.ch (A. Zinonos)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
verification task. Successful methods and their maximum performance on an identical task
and evaluation metric as this year’s task are presented, with differences in context in the data
set. [4] concludes that best performance is achieved by combining heterogeneous methods,
for instance, applying machine learning techniques such as decision trees and artificial neural
networks.
   [5] aims to analyze the problem of correlating the author’s characteristics with the attributes
of documents written by that same author. They highlight that the first step should be identifying
the essential feature in the text to conduct a better analytic phase. Once the relevant features
are extracted from a document, different methods can be experimented with to identify the
author. This was useful for their implementation of SVMs and Random Forests.
   Transformer-based architectures have been experimented with on earlier versions of this
task, but were limited to using BERT. In [6], a pre-trained model of Bert was presented as a
solution for encoding text information of text pairs. Data-record splitting was introduced to
create short texts that can be encoded by BERT, achieving the highest c@1 and F1-score on
PAN Authorship Verification datasets.


3. Material and Methods
3.1. An overview on the dataset
The provided dataset contains 12,264 pairs, from which 10,424 (85%) are used for training and
the rest for validation. Furthermore, the dataset comes with many peculiarities. A correlation
between authors is challenging because the texts come from different writing scenarios. For
example, most people follow a formal way of writing business memos. As a result, it is more
difficult to find stylistic similarities between e-mails and business memos and text messages or
essays, even if a pair is written by the same author.
   When using transformers, the common practice is to encode the dataset in its original form
without applying much preprocessing. After comparing emails and texts, we decided to remove
HTML character artifacts as they do not contribute to authorship verification from a stylometric
point of view. Since the Roberta model uses byte pair encoding, we suspected that the HTML
characters would artificially increase a pair’s dissimilarity.

3.1.1. Distribution of Text Length
First, an exploratory data analysis of the given dataset was conducted. The overall text length
distribution at word level and the distribution of different types of text lengths is shown in
Table 1 and Figure 1. This illustrates the necessity of assigning a high maximum length for
tokenization, including padding and truncation.
   The box plot shows that the length of the different discourse types varies greatly. The ‘essay’
type is much longer than the others, and the ‘text message’ type is the shortest.
                                               Overall     Essay    Text_Message         Email      Memo
                                        mean      410      1718                   96         1718     220
                                        min        31       240                   63          240      31
                                        25%        96      1217                   87         1217     169
                                        50%       289      1603                   93         1603     212
                                        75%       363      2254                  100         2254     292
                                        max      3270      3270                  474         3270     416

Table 1
Length of different types of text.

                                                  Box plot of text length distribution
                                                                                             essay
                                                                                             text_message
                                                                                             email
                                                                                             memo
                  Different Text Type


                                                     102                               103
                                                                   Text length
Figure 1: Box plot of the text length distribution of different text types. Granularity of counting is at
the word level. Logarithmic scaling was applied to the 𝑥-axis.


3.1.2. Statistics for Pairs
To better understand the dataset, an analysis was performed at the discourse type (DT) level
and the similarity was calculated for each of them – see table 2. Here, texts are encoded with
the S-BERT model.

3.2. Encoder Experiments
The aim of our submission is to test pre-trained models of different architectures. These models
are to be fine-tuned to this task using the dataset provided. For text comparison tasks such as
semantic similarity or author verification, two approaches are popular: the Bi-Encoder and the
Cross Encoder approach. These architectures are considered to be particularly powerful for
similarity tasks.
  MPNet [7] is trained using permuted language modeling and claims to gain a better un-
derstanding of bidirectional contexts, which may prove crucial for a text similarity task. The
            Discourse Type Pairs (DT)     # Pairs   # Identical Authors   Mean Similarity
                   essay, email             1618                    809             0.4850
               email, text_message          7484                   3742             0.6127
               essay, text_message          1182                    591             0.4803
                  memo, email               1014                    507             0.4952
               memo, text_message            780                    390             0.5459
                   essay, memo               186                     93             0.4161

Table 2
Analysis on each DT combination. ‘Mean Similarity’ is the mean of cosine similarity of all pairs in each
DT combination.


Roberta models [8] that we used were already fine-tuned on similarity tasks, thus enabling the
transfer of text similarity knowledge to our task. Therefore, we expect that the training time will
be shorter and the performance will be better. The pretrained model configurations are shown
in Table 3. The maximum length in Bi-Encoder models is 256, which is half of Cross-Encoder
because Bi-Encoders encode two sentences separately: limiting the context to two 256 subtoken
sequences, which are then concatenated. Note that for the experiments, only the pre-trained
base models with a maximum text length of 512 subtokens were used.

   Encoding Approach                  Models                 Max. Subtokens     Epochs       Batchsize
     Cross Encoder        BertForNextSentencePrediction            512             8            16
     Cross Encoder               Roberta-Muppet                    512             5             6
     Cross Encoder                Roberta-stsb                     512             5             6
      Bi-Encoder                    MPNET                          256             5            12
      Bi-Encoder                 Roberta-Muppet                    256             5            12

Table 3
Details about the tested pretrained models. ‘Max.Subtokens’ is the value of hyperparameter ‘max_length’
in tokenizer. ‘Batchsize’ is the size of the batch used in training.


3.2.1. Cross-Encoder Approach
In the Cross-Encoder architecture, both texts are simultaneously fed through a transformer
network. In this architecture, a single encoding for both texts is used for classification (see
Figure 2 on the right). Cross-Encoders are normally used when you have a pre-defined set of
text pairs that you want to score. Cross-encoders usually outperform Bi-Encoders, but do not
scale well with large datasets. That’s why Cross-Encoders seem suitable for our task.
   For the Cross-Encoder approach, we used several Roberta models: Roberta-Muppet, Roberta
fine-tuned on the STSB task [9], and plain BERT that was trained with the Next Sentence
Prediction Task [10].
3.2.2. Bi-Encoder Approach
Bi-Encoder architecture creates a twin network that processes two sentences simultaneously in
the same way [11]. All parameters are shared. The pooling layer creates fixed-size representation
for input sentences of varying lengths, while also extracting the features that are considered
the most important ones (see Figure 2 on the left).
  For the Bi-Encoder approach, we experiment with Roberta-Muppet and MPNet adding a
pooling and dense layer after the standard encoding step.

3.2.3. Cross-Encoder vs Bi-Encoder Approaches
The main difference between these two architectures is that two sentences are concatenated
in a Cross-Encoder using the SEP special token as a separator. Thus, the encoder can have
attention to information from both sentences, while in the Bi-Encoder architecture, the word
embeddings of the two sentences are encoded separately and then concatenated. In addition,
the Bi-Encoder does not compute attention information between the subtokens of the two texts,
as the embedding process is done separately.


Figure 2: Schematic diagram of the structure of our two different encoder approaches [11]. Illustrated
using a BERT model.


3.3. Dataset Manipulation Experiments
We also experimented with modifying and manipulating the dataset, in particular, splitting
the text segments, negative sampling, and combining these two techniques. The purpose is to
explore whether these techniques improve prediction results.

3.3.1. Splitting Text Segments
Splitting text segments in half results in twice as much data items as before. Since the models
see smaller contexts during training, we hope that shorter but more texts force the model to
better learn authorial characteristics that remain stable across various discourse types.
3.3.2. Negative Sampling
The negative sampling method was initially used to accelerate the training of Skip-Gram models
[12] and has since been widely used in the natural language processing. Negative sampling
serves two purposes: efficiency and effectiveness.

    • Efficiency: Negative sampling can reduce the training load by optimizing only the vectors
      involved in the cost-finding process.
    • Effectiveness: Negative sampling provides high-quality negative examples in a targeted
      manner, both to speed up convergence and allow the model to be optimized in the desired
      direction.

  In this dataset experiment, all authors were mapped with their corresponding texts, and new
pairs of negative data are created by combining texts from different authors.
  Our aim was to create a dataset with 20% (6,132) positive samples and 80% (approx. 24,000)
negative samples.


4. Evaluation
To evaluate the proposed models, we use the TIRA evaluation tool [13] with the following
metrics:
   AUC: The area-under-the-curve (ROC) score
   F1-score: F1 score is the harmonic mean between precision and recall.
   c@1: A variant of the F1-score, which rewards systems that leave difficult problems unan-
swered, like scores of exactly 0.5.
   F0.5𝑢 : A measure that puts more emphasis on deciding same-author cases correctly.
   Brier: The complement of the Brier score for evaluating the goodness of (binary) probabilistic
classifiers.

  The results of the models trained on this task will then be compared with a set of baseline
results. The baseline results are obtained using a simple method that calculates the cosine
similarities between TFIDF-normalized, bag-of-character-tetragrams representations of the text
pairs. Then the resulting scores are shifted using a simple grid search, to arrive at an optimal
performance on the validation set.
  It has to be noted that the validation set across all models are always to be kept the same.


5. Results and Discussion
In Table 4, the results of the trained models are shown. Overall, the models scored decently on
the validation dataset with an overall score mostly higher than 70%, higher than the baseline.
MPNet using the Bi-Encoder approach scored the highest. Its permutation language modeling
turned out to be better than the other models that used masked language modeling in pretraining.
   The two Roberta-Muppet models performed similarly, falling right behind MPNet. At the
same time, the two Roberta-Muppet models managed to perform considerably better than the
                                                           Model                        auc        c@1        F0.5𝑢     F1        brier       Overall
          Voting Ensemble 3 Models                                                      0.765      0.759      0.718     0.800     0.759       0.760
                  Bi-MPNet                                                              0.777      0.771      0.729     0.807     0.771       0.771
              Bi-Roberta-Muppet                                                         0.748      0.743      0.708     0.781     0.743       0.745
             CE-Roberta-Muppet                                                          0.749      0.744      0.709     0.782     0.744       0.746
               CE-Roberta-stsb                                                          0.68       0.672      0.65      0.745     0.672       0.684
       CE-BertForNextSentencePrediction                                                 0.705      0.701      0.679     0.724     0.701       0.702
                   Baseline                                                             0.55       0.5        0.546     0.671     0.749       0.603
             Bi-MPNet (on TIRA)                                                         0.577      0.557      0.563     0.581     0.589       0.573

Table 4
Results of each model on the validation set and the best model’s performance on the test set. “Bi-” prefix
represents the Bi-Encoder Architecture & “CE-”prefix represents Cross-Encoder Architecture approach.
The voting ensemble with 3 models includes: Bi-MPNet, Bi-Roberta-Muppet, CE-Roberta-Muppet.
Bolded scores mark the best performance on each metric. Bi-MPNet (on TIRA) was evaluated on the
test set of TIRA.


remaining two models. A possible explanation is that Roberta-Muppet’s multitask pre-training
translated into better stylometric understanding due to its more generalized embeddings.
   In Figure 3, an analysis of the predictions of MPNet is shown to check the accuracy results on
all combinations of discourse types on the validation set. The accuracy results range around 75%
across all combinations. What is interesting is that all discourse type combinations achieved a
similar score with marginal differences. Together with the F0.5𝑢 results, this indicates that up to
a certain level, the model managed to learn authorial characteristics across discourse types.


                                                               Percentages of correct Predictions of all Discourse Type Combinations
                                            100                                                                                         Correct
                                                                                                                                        Wrong
        Percentage of Correct Predictions


                                            80
                                                        77.4                           77.3                           76.6             78.8
                                                                        72.4                           71.6
                                            60


                                            40


                                            20


                                             0
                                                  email-text_message essay-email    essay-memo essay-text_message memo-email memo-text_message
                                                                                     Discourse Type Combinations


Figure 3: Accuracy results on all combinations of discourse types on the validation set.


  Further experimentation is reported in Table 5. The models chosen for this analysis were the 3
best performing ones from Table 4. The results show that the dataset manipulation experiments
                             auc                     c@1                     F0.5𝑢                   F1                      brier                   overall
      Model
                     SPLIT   NEG     S.-N.   SPLIT   NEG     S.-N.   SPLIT NEG       S.-N.   SPLIT   NEG     S.-N.   SPLIT   NEG     S.-N.   SPLIT    NEG      S.-N.


     Bi-MPNet        0.689   0.499   0.501   0.689   0.499   0.501   0.674   0.555   0.556   0.739   0.666   0.667   0.689   0.499 0.501     0.696    0.544    0.545

 Bi-Roberta-Muppet   0.640   0.501   0.672   0.640   0.501 0.672     0.637   0.556 0.658     0.691   0.667 0.742     0.640   0.501 0.672     0.650    0.545 0.684

 CE-Roberta-Muppet   0.729   0.696   0.590   0.729   0.696   0.590   0.705   0.676   0.603   0.770   0.756 0.699     0.729   0.696   0.590   0.732    0.704 0.615


Table 5
Results of each model on the validation set on 3 experimental variants: “SPLIT” means splitting text
segments in half (doubling the data item number). “NEG” is negative sampling. “S.-N.” uses both
SPLIT and NEG. “B-” prefix represents the Bi-Encoder approach and “CE-” prefix the Cross-Encoder.
Underlined scores mark the experiment’s (column) highest score and bold-ed scores mark the highest
score of the metric.


did not contribute positively to the stylometric learning of the models. We observe a decrease
in the performance of MPNet. Maybe splitting the data caused its permutation language model
to be less effective given that less information for each shortened text, is encoded.
   Negative sampling reduced the performance for all models. This discrepancy could be
attributed to the positive to negative ratio of the dataset, therefore a smaller sample of negative
data could have been better.
   Finally, what we observe is that the Cross-Encoder models with Roberta-Muppet are clearly
better than the other two models in the dataset manipulation experiments.


6. Conclusion
In our submission, we tested various pre-trained encoder models using Cross-Encoder and
Bi-Encoder architectures to solve the authorship verification problem of PAN@CLEF 2022. We
conclude that the inherent difficulty of the dataset is the major obstacle because the different
types of discourse require different linguistic expressions, for instance, the considerable dissimi-
larity between business memos and text messages. The results show that a simple approach of
selecting a pre-trained model and fine-tuning it is able to grasp some stylometric information
useful for author verification, but overall the performance is not strong enough to reliably solve
this difficult task.


Acknowledgments
Special thanks to Dr. Simon Clematide and Andrianos Michail for all the help and guidance
given to our team for the completion of this work. Moreover, special thanks to the PAN members
for the support given to us in situations of technical difficulties.
References
 [1] Efstathios Stamatatos and Mike Kestemont and Krzysztof Kredens and Piotr Pezik and
     Annina Heini and Janek Bevendorff and Martin Potthast and Benno Stein, Overview of the
     Authorship Verification Task at PAN 2022, in: CLEF 2022 Labs and Workshops, Notebook
     Papers, CEUR Workshop Proceedings, 2022.
 [2] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl,
     R. Ortega-Bueno, P. Pezik, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wieg-
     mann, M. Wolska, E. Zangerle, Overview of PAN 2022: Authorship Verification, Profiling
     Irony and Stereotype Spreaders, and Style Change Detection, in: Alberto Barron-Cedeno,
     Giovanni Da San Martino, Mirko Degli Esposti, Fabrizio Sebastiani, Craig Macdonald,
     Gabriella Pasi, Allan Hanbury, Martin Potthast, Guglielmo Faggioli, Nicola Ferro (Ed.),
     Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the
     Thirteenth International Conference of the CLEF Association (CLEF 2022), volume 13390
     of Lecture Notes in Computer Science, Springer, 2022.
 [3] M. Kestemont, E. Manjavacas, I. Markov, J. Bevendorff, M. Wiegmann, E. Stamatatos,
     B. Stein, M. Potthast, Overview of the cross-domain authorship verification task at pan
     2021, in: CLEF (Working Notes), 2021, pp. 1743–1759. URL: http://ceur-ws.org/Vol-2936/
     paper-147.pdf.
 [4] E. Stamatatos, Authorship verification: A review of recent advances, Research in Comput-
     ing Science 123 (2016) 9–25. doi:10.13053/rcs-123-1-1.
 [5] P. Juola, Authorship attribution, Foundations and Trends® in Information Retrieval 1
     (2008) 233–334. doi:10.1561/1500000005.
 [6] Z. Peng, L. Kong, Z. Zhang, Z. Han, X. Sun, Encoding text information by pre-trained
     model for authorship verification, in: CLEF, 2021.
 [7] K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, Mpnet: Masked and permuted pre-training for
     language understanding, arXiv preprint arXiv:2004.09297 (2020).
 [8] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
 [9] P. May, Machine translated multilingual sts benchmark dataset., 2021. URL: https://github.
     com/PhilipMay/stsb-multi-mt.
[10] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
     org/abs/1810.04805. arXiv:1810.04805.
[11] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     arXiv preprint arXiv:1908.10084 (2019).
[12] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of
     words and phrases and their compositionality, Advances in neural information processing
     systems 26 (2013).
[13] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
     in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
     Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/
     978-3-030-22948-1\_5.