1. Introduction

Neural Features Combined Deep Bayesian Classifier for Authorship Verification

Yitao Sun

ysun@pindrop.com

Svetlana Afanaseva

safanaseva@pindrop.com

Kailash Patil

kpatil@pindrop.com

CEUR-WS

2023

This paper describes the approach of a deep learning model for the PAN 2023 Cross-Discourse Type Authorship Verification Task [ 1]. We present a hierarchical fusion of two well-established approaches into a single end-to-end learning process: A deep metric learning framework at the top aims to align and learn from a pseudo-metric that maps a document of variables to a fixed-length feature vector. A separate extraction layer then extracts stylometric features from the document. Finally, the Bayesian probabilistic layer scores the concatenated features to predict the similarity of the documents. deep learning, authorship verification, stylometric, machine learning, natural language processing, NLP, https://www.linkedin.com/in/yitao-s-146015104/ (Y. Sun)

Verification

1. Introduction

Authorship verification (pairwise) involves determining whether two documents were authored by the same individual. Traditionally, linguists have undertaken authorship verification to ascertain the authorship of anonymous texts by examining specific linguistic features. These features encompass a range of elements, such as errors (e.g. spelling mistakes), peculiarities in the text (e.g. grammatical inconsistencies), and patterns of writing style [ 2 ].

Automated systems, particularly those based on machine learning, have heavily depended on stylometric features [ 3 ]. These features are derived from linguistic metrics and are commonly used to analyze text. However, one limitation of stylometric features is that their efectiveness tends to decrease when applied to texts that exhibit significant variations in topics.

On the other hand, deep learning systems [ 4 ] can be designed to autonomously learn neural features in a comprehensive manner. These features can be insensitive to the specific topic of the text. However, a drawback of such features is that they are generally not easily interpretable from a linguistic perspective.

In this study, we present a significant expansion of a popular and previously published ADHOMINEM method [ 4 ]. In our extended approach, we not only analyze the neural features nEvelop-O LGOBE generated by ADHOMINEM using a metric perspective but also incorporate a stylometric viewpoint. This allows for a more comprehensive extraction of features from the documents.

This paper is structured as follows: we will describe our approach in Section 2, present our evaluation results in Section 3 and discusses our conclusions and future work in Section 4.

2. Approach

We pre-define a deep learning model architecture along with its hyper-parameters and thresholds and allow the model to autonomously learn suitable features for the provided setup. This approach is in line with most deep-learning methodologies. The success of our proposed setup heavily relies on the availability of a large collection of text samples that encompass diverse variations in writing style, enabling the model to learn efectively.

We utilize a predecessor of our ADHOMINEM system [ 4 ] as a deep metric learning framework [ 5 ] and document-level Stylometric features extractor to assess the similarity between two text samples. The concatenated features generated by the system are then inputted into a probabilistic linear discriminant analysis (PLDA) layer [ 6 ]. This layer serves as a pairwise discriminator, conducting Bayes factor scoring within the learned metric space, thus contributing to the discriminative power of our method.1

2.1. Neural extraction of linguistic embedding vectors (LEV) [5]

A text sample can be seen as a hierarchical structure composed of discrete elements arranged in a specific order. It starts with a list of sentences, where each sentence is comprised of an ordered sequence of tokens. Furthermore, each token consists of an ordered sequence of characters. The primary objective of ADHOMINEM is to transform a document into a feature vector. Specifically, its Siamese topology incorporates a hierarchical neural feature extraction process that captures the stylistic attributes of a pair of documents (D1, D2), which can have varying lengths. This process results in a pair of fixed-length linguistic embedding vectors (LEVs), denoted as = ( ) ∈ ℝ×1 , ∈ {1, 2} (1) we denote the dimension of the linguistic embedding vectors (LEVs) as D, and represents all the trainable parameters involved. This network is referred to as a Siamese network because both documents 1 and 2 undergo mapping through the exact same function.

2.2. Stylometric features layer (SFL)

In this section, we outline the features, which are commonly utilized in previous stylometry research [ 3 ]. We selected these features from the Writeprints feature set introduced by Weerasinghe [ 7 ]. Additionally, recognizing the importance of the syntactic structure of sentences in providing informative signals to the classifier, we included POS-Tag n-grams and partial parses (or POS-Tag chunks) as part of our feature set, following the approach of previous studies [ 8 ]. Sidorov et al. [ 9 ] introduced the use of parse trees for extracting stylometric features, specifically syntactic dependency-based n-grams of POS tags. However, we employed a slightly diferent method to encode parse tree features, which focuses on capturing the construction of diferent noun and verb phrases.

Furthermore, several features were computed based on TFIDF (Term Frequency-Inverse Document Frequency) values. We utilized NLTK’s TFIDFVectorizer to compute the TF-IDF vectors for the documents. To exclude tokens with a document frequency below 10%, we set the min token parameter to 0.1.

package. • C h a r a c t e r n - g r a m s : TF-IDF values for character n-grams, where 1 ≥ ≥ 6 . • P O S - T a g n - g r a m s : TF-IDF value of POS-Tag tri grams. • F r e q u e n c y o f F u n c t i o n W o r d s : Frequencies of 179 stopwords defined in the corpus • V o c a b R i c h n e s s : computed by dividing the combined count of words that appear only once (hapax-legomenon) and words that appear twice (dis-legomenon) in the document, by the total number of tokens in the document. This normalization accounts for variations in document lengths.

tokens at the second level of our parse tree. • P O S - T a g C h u n k s : TF-IDF values for Tri-grams of POS-Tag chunks. Here, we consider the • N P a n d V P c o n s t r u c t i o n : TF-IDF values of each noun phrase of verb phrase expansion. • n u m b e r o f c h a r a c t e r s • n u m b e r o f w o r d s

After concatenating the above features, we use truncated singular value decomposition (SVD) to reduce the dimensions from 8708 to 10 dimensions before concatenating with LEVs.

2.3. Bayes factor scoring [10]

Text samples exhibit significant variations, making it valuable to employ statistical hypothesis tests to quantify the outputs or scores generated by our algorithm. These tests aid in determining whether to accept or reject a decision. ADHOMINEM has the potential to incorporate a framework for conducting statistical hypothesis testing. Specifically, we focus on the authorship verification (AV) problem, where we are presented with the linguistic embedding vectors (LEVs) and Stylometric features layer (SFL) of two documents. We concatenate them into combined layers (CLs) and then make a decision based on one of two hypotheses: • ℋ : The two documents were written by the same person, • ℋ : The two documents were written by two diferent persons.

⏟ combined layers =

⏟ author’s writing style +

⏟ noise term The combined layer

is decomposed into a latent writing style vector and a noise term are in Eq. (2). The probability density functions for and are as shown in Eq. (3): ( ) = ( ∣ , −1) ( ) = ( ∣ 0, −1) ( ∣ ) = ( ∣ , −1) Eq. (4). The joint probability density function is then given by:

Same-author pair probability: A single latent vector 0 representing the author’s writing style is generated from the prior ()

and both , ∈ {1, 2} are generated from ( ∣ 0) in ( 1, 2 ∣ ℋ ) = ( 1, 2 ∣ 0, ℋ ) ( 0 ∣ ℋ )

( 1 ∣ 0) ( 2 ∣ 0) ( 0) = ( 0 ∣ 1, 2, ℋ ) ( 0 ∣ 1, 2) Diferent-authors pair probability:

Two latent vectors , ∈ {1, 2} representing the distinct writing characteristics of two diferent authors are generated independently from the prior distribution () . The corresponding linguistic embedding vectors are generated from the conditional distribution ( ∣ ). The joint probability density function can then be expressed as follows: ( 1, 2 ∣ ℋ ) = ( 1 ∣ ℋ ) ( 2 ∣ ℋ ) = ( 1 ∣ 1) ( 1) ( 2 ∣ 2) ( 2) ( 2 ∣ 2) (2) (3) (4) (5) (6)

Verification process: The probabilistic model described consists of two distinct phases: a training phase and a verification phase. During the training phase, the parameters of the Gaussian distributions in Eq. (3)-(4) are learned. These distributions capture the characteristics of the latent vectors and linguistic embedding vectors. In the verification phase, the model is utilized to determine whether the two text samples originate from the same author based on the learned parameters as shown in Eq. (7).

1, 2) indicates higher similarity and vice versa.

3. Training Details

We implemented our training algorithm in Python. We conducted our preprocessing in our customized regular expression function and then use spaCy _ _ _ to do sentence boundary detection and tokenization. Given that the stylometric part of the model is set and described, we fine-tuned our deep Bayesian model to achieve higher performance. However, none of the fine-tuning trials’ performance exceeds the default hyper-parameters model. Details are as follows:

For the final submitted model in Tira [ 11 ], we used the entire training dataset with the above hyper-parameters setting and combined stylometric layers outputs to train the deep Bayesian model. We took epoch number 8, 24, and 35 for our final three submissions.

4. Evaluation

The following table presents the experimental results conducted on the competition dataset. The dataset was divided into train and test sets for evaluation purposes. In our analysis, we compared the performance metrics provided by the PAN competition with two baseline models, our predecessor the deep metric model (DML, a model that directly learns from LEV [ 5 ]), and the uncertainty adaptation layer model (UAL, which models the noise behavior [12]), and the Bayes factor scoring model (BFS) with/without Stylometric features layer(SFL).

Naive, Distance-based Method-based text compression

DML without SFL UAL without SFL BFS without SFL DML with SFL UAL with SFL BFS with SFL

5. Conclusions

We have introduced a novel approach to authorship verification (AV) that combines neural feature extraction and stylometric features with statistical modeling. The observed performance improvements afirm the value of the proposed enhancements in the ADHOMINEM model, emphasizing the significance of the feature selection technique and the utilization of stylometric features for the authorship verification task.

In AV, there are numerous factors that introduce variabilities, such as topic, genre, text length and text types, which can negatively impact the performance of the system. However, we believe that there is significant potential for further improvements by incorporating compensation techniques to address these aspects in future challenges.

Acknowledgments

We thank 2023 [13] organizers for arranging this task and helping us through the submission process. We also thank the reviewers for their helpful comments and feedbacks. Our work was supported by Pindrop. Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. [12] B. T. Boenninghof, D. Kolossa, R. M. Nickel, Self-calibrating neural-probabilistic model for authorship verification under covariate shift, CoRR abs/2106.11196 (2021). URL: https: //arxiv.org/abs/2106.11196. a r X i v : 2 1 0 6 . 1 1 1 9 6 . [13] J. Bevendorf, I. Borrego-Obrador, M. Chinea-Ríos, M. Franco-Salvador, M. Fröbe, A. Heini, K. Kredens, M. Mayerl, P. Pęzik, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wolska, , E. Zangerle, Overview of PAN 2023: Authorship Verification, Multi-Author Writing Style Analysis, Profiling Cryptocurrency Influencers, and Trigger Detection, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, A. G. Stefanos Vrochidis, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2023), Lecture Notes in Computer Science, Springer, 2023.

[1]

Stamatatos ,

Kredens ,

Pezik ,

Heini ,

Bevendorf ,

Potthast ,

Stein , Overview of the Authorship Verification Task at PAN 2023, in: CLEF 2023 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2023 .

[2]

Ehrhardt , 7 . Authorship attribution analysis, De Gruyter Mouton , Berlin, Boston, 2018 , pp. 169 - 200 . URL: https://doi.org/10.1515/ 9781614514664 - 010 . doi: d o i : 1 0 . 1 5 1 5 / 9 7 8 1 6 1 4 5 1 4 6 6 4 - 0 1 0 .

[3]

Stamatatos , A survey of modern authorship attribution methods , Journal of the American Society for Information Science and Technology ( 2009 ).

[4]

B. T.

Boenninghof ,

Hessler ,

Kolossa ,

R. M.

Nickel , Explainable authorship verification in social media via attention-based similarity learning , CoRR abs/ 1910 .08144 ( 2019 ). URL: http://arxiv.org/abs/ 1910 .08144. a r X i v : 1 9 1 0 . 0 8 1 4 4 .

[5]

Boenninghof ,

R. M.

Nickel ,

Zeiler ,

Kolossa , Similarity learning for authorship verification in social media , in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2019 . URL: https://doi.org/10. 1109% 2Ficassp . 2019 . 8683405 . doi:1 0 . 1 1 0 9 / i c a s s p . 2 0 1 9 . 8 6 8 3 4 0 5 .

[6]

Cumani ,

Brümmer ,

Burget ,

Laface ,

Plchot ,

Vasilakakis , Pairwise discriminative speaker verification in the i-vector space , IEEE Transactions on Audio, Speech, and Language Processing 21 ( 2013 ) 1217 - 1227 . doi:1 0 . 1 1 0 9 / T A S L . 2 0 1 3 . 2 2 4 5 6 5 5 .

[7]

Weerasinghe ,

Greenstadt , Feature Vector Diference based Neural Network and Logistic Regression Models for Authorship Verification-Notebook for PAN at CLEF 2020 , in: L. Cappellato , C.

Eickhof , N.

Ferro , A . Névéol (Eds.), CLEF 2020 Labs and Workshops, Notebook Papers, CEUR-WS .org, 2020 .

[8]

Kestemont ,

Manjavacas , I. Markov ,

Bevendorf ,

Wiegmann , E. Stamatatos,

Potthast ,

Stein , Overview of the cross-domain authorship verification task at pan 2020 , in: Conference and Labs of the Evaluation Forum , 2020 .

[9]

Sidorov ,

Castillo ,

Stamatatos ,

Gelbukh ,

Chanona-Hernández , Syntactic ngrams as machine learning features for natural language processing , Expert Systems with Applications: An International Journal 41 ( 2014 ) 853 - 860 . doi:1 0 . 1 0 1 6 / j . e s w a . 2 0 1 3 . 0 8 . 0 1 5 .

[10]

B. T.

Boenninghof ,

Rupp ,

R. M.

Nickel ,

Kolossa , Deep bayes factor scoring for authorship verification , CoRR abs/ 2008 .10105 ( 2020 ). URL: https://arxiv.org/abs/ 2008 .10105. a r X i v : 2 0 0 8 . 1 0 1 0 5 .

[11]

Fröbe ,

Wiegmann ,

Kolyada ,

Grahm ,

Elstner ,

Loebe ,

Hagen ,

Stein ,

Potthast , Continuous Integration for Reproducible Shared Tasks with TIRA.io , in: J. Kamps , L.

Goeuriot , F.

Crestani , M.

Maistro , H.

Joho , B.

Davis , C.

Gurrin , U.

Kruschwitz , A . Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR