=Paper= {{Paper |id=Vol-3180/paper-193 |storemode=property |title=A content spectral-based analysis for authorship verification |pdfUrl=https://ceur-ws.org/Vol-3180/paper-193.pdf |volume=Vol-3180 |authors=Melesio Crespo-Sanchez,Helena Gómez-Adorno,Ivan Lopez-Arevalo,Edwin Aldana-Bobadilla,Karla Salas-Jimenez,Jorge Cortes-Lopez |dblpUrl=https://dblp.org/rec/conf/clef/Crespo-SanchezG22 }} ==A content spectral-based analysis for authorship verification== https://ceur-ws.org/Vol-3180/paper-193.pdf
A Content Spectral-Based Analysis for Authorship
Verification
Notebook for PAN at CLEF 2022

Melesio Crespo-Sanchez1 , Helena Gómez-Adorno2 , Ivan Lopez-Arevalo1 ,
Edwin Aldana-Bobadilla1 , Karla Salas-Jimenez3 and Jorge Cortes-Lopez3
1
  Centro de Investigación y de Estudios Avanzados del I.P.N. Unidad Tamaulipas, Victoria 87130, México.
2
  Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, UNAM, Ciudad de México 04510, México.
3
  Facultad de Ciencias, UNAM, Ciudad de México 04510, México.


                                         Abstract
                                         Authorship verification aims at determining if the same author produced a given a pair of texts. This
                                         task involves analyzing the documents’ essential features, such as the used vocabulary, i.e., the lexical
                                         content; the syntactic content reflected by how the author makes combinations of the different words
                                         in such vocabulary following grammar rules; and the semantic content of the documents. This work
                                         presents a content spectral-based analysis approach, using neural network techniques for the authorship
                                         verification task at PAN at CLEF 2022.

                                         Keywords
                                         authorship verification, content spectral-based analysis, neural networks, PAN at CLEF




1. Introduction
PAN1 at CLEF2 is a series of scientific events and shared tasks on digital text forensics and
stylometry, such as Style Change Detection, Profiling Irony, Stereotype Spreaders on Twitter,
and Authorship Verification [1]. In this paper, we present an approach for the authorship
verification task at PAN 2022 [2], which description is as follows: given a pair of texts belonging
to different discourse types, the challenge is to determine if the same author wrote these or not.
From a machine learning perspective, this is a binary classification task.
   Nowadays we can find different approaches to solve the authorship verification task like
machine learning-based [3], distance-based [4], or deep learning-based approaches [5]. A
common factor between all these techniques is the text representation in the vector space,
where each vector abstracts relevant features of the text. In this space, we can find clusters of
CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2021, Bologna, Italy
$ melesio.crespo@cinvestav.mx (M. Crespo-Sanchez); helena.gomez@iimas.unam.mx (H. Gómez-Adorno);
ilopez@cinvestav.mx (I. Lopez-Arevalo); edwyn.aldana@cinvestav.mx (E. Aldana-Bobadilla);
karla_dsj@ciencias.unam.mx (K. Salas-Jimenez); kokofrank@ciencias.unam.mx (J. Cortes-Lopez)
€ https://github.com/helenpy (H. Gómez-Adorno)
 0000-0001-5688-5352 (M. Crespo-Sanchez); 0000-0002-6966-9912 (H. Gómez-Adorno); 0000-0002-7464-8438
(I. Lopez-Arevalo); 0000-0001-8315-1813 (E. Aldana-Bobadilla)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
                    https://pan.webis.de/
                  2
                    https://clef2022.clef-initiative.eu/
vectors that may denote certain classes of objects with similar features. Text representations
should consider lexical, syntactic, and semantic components [6]. Lexical component associated
with the used vocabulary in the texts. Syntactic component associated with how words structure
a text that may denote a writing style (relevant for authorship verification). Semantic component
associated with the main idea conveyed in a text.
   In this work, we tackle the authorship verification task by transforming texts into a content
spectral-based representation [7], which takes into account the previously mentioned text
components used by a machine learning algorithm for detecting if pairs of documents belong
to the same author.
   The rest of this paper is structured as follows. Section 2 synthesizes some works related to our
approach. Section 3 describes the dataset used for the task and presents the training, validation,
and test partitions we generated to train the models. Section 4 describes the methodology for
data transformation and classification. In section 5, the experiment configuration and results
are reported. Section 6 finishes the paper and gives future work direction.


2. Previous Work
Authorship verification is still an area of exploration and development of strategies that evolve
between the different problems and results [8]. Deep learning approaches have been used to
solve authorship verification problems in recent years. Some examples of these works are the
well-known pre-trained models such as transformers [9, 10, 11]. In PAN 2020, [5] introduced
a Siamese network that learns the difference between two documents with a fully connected
layer using n-grams and a residual network.
   Other approaches use stylometric features, and lexical, syntactic, and semantic features as
input for classical machine learning algorithms [12]. Sometimes is better to perform manual
feature extraction to train some classification models. In this work, we propose to explore
features from different linguistic levels of the language description lexical, syntactic, and
semantic and represent the documents in a combined vector space.


3. Dataset
For this shared task, the dataset to be analyzed is the one provided by the PAN3 . The following
subsections review the dataset and explain the division of the dataset in training, validation,
and test partitions that we use to train and evaluate our authorship verification model.

3.1. Data Review
The PAN 2022 authorship verification shared task organizers provided a training dataset con-
taining 12, 264 instances of problems. Each problem corresponds to a pair of authors’ texts, the
discourse types, and a label that identifies if the same author wrote the texts. Below, we show
an example of an instance’s structure:

   3
       https://pan.webis.de/data.html
{"id": "instance id", "discourse_type": ["essay", "email"], "pair": ["Text 1...", "Text 2..."]}


   We identified a total of 1046 unique documents. The unique discourse type classes identified
for these texts were email, essay, memo, and text_message. The distribution of discourse-type
classes for these texts is shown in Table 1.

Table 1
Distribution of texts per discourse type.
                                          Discourse type    Total texts
                                              email               507
                                              essay                93
                                              memo                56
                                          text_message            390


   We identified a total of 56 unique authors in the training dataset, with a balanced number of
instances, 6132 positive (documents written by the same author) and 6132 negative (documents
written by different authors).

3.2. Dataset Partitions
After analyzing the training dataset provided by the organizers, we divided it into three partitions:
60% for training, 10% for validation, and 30% for testing, to train and evaluate our neural
network approach. With the aim of training and evaluating our models on different authors,
the partitioning needs to be author disjoint, i.e., the intersection of authors on the training,
validation, and testing sets should be empty.
   To accomplish this, we ordered the authors by the number of texts written to include authors
with the highest number of texts in the training partition. Then we deleted pairs with an author
in different partitions (about 5514 instances). Also, to solve the problem of small partitions,
we removed texts with less than 500 tokens. Table 2 shows the distribution of the obtained
partitions. Nevertheless, these partitions are unbalanced, which could cause a bias towards one
of the two classes.

Table 2
Initial partition distribution.
                 Partition        Total instances   Positive instances    Negative instances
                  Train                5889                5469                  420
                Validation             287                 274                   13
                   Test                 411                 389                   22


  Additionally, we added new instances of document pairs written by the same author (positive)
and different authors (negative) to balance the partitions. For this, we applied the next process:
Let 𝐴 and 𝐵 be the subsets of unique documents by each author from the corpus. Positive (P)
and negative (N) instances were obtained via cartesian product, 𝑃 = 𝐴 × 𝐴 and 𝑁 = 𝐴 × 𝐵,
respectively. Then, we randomly selected positive and negative instances from 𝑃 and 𝑁
sets, respectively, to balance the training, validation, and test partitions. Table 3 shows the
distribution of the final partitions with the total number of instances, how many of these are
positive, and how many are negative cases.

Table 3
Final partition distribution.
                 Partition      Total instances   Positive instances   Negative instances
                  Train             15,732              7866                 7866
                Validation            754               377                  377
                   Test              1070                535                  535




4. Methodology
Several techniques can be useful in the machine learning domain to solve the authorship
verification problem. Among this repertoire of techniques, artificial neural networks are some of
the most popular [13]. In this work, we used a multilayer perceptron network as a classification
algorithm to determine if each pair of texts belong to the same author or not.
   An adequate data transformation is required to train a machine learning model, i.e., obtain a
vector space representation of the objects in the training set. Such a representation must hold
features that allow the machine learning algorithm to perform a category separation in the
vector space. Each category must include elements that share common characteristics, denoting
the problem’s classes of interest. Given the data type in this problem (text) and its nature, we
assume that the representation of the different texts in the dataset must abstract elements such
as vocabulary, writing style, and main idea in the content. These elements fall on the text’s
lexical, syntactic, and semantic components.
   For this reason, we opted to use the spectral text representation proposed in [7] as a data
transformation method. To obtain a representation of each document in the dataset, we identified
a set of unique texts in the PAN dataset. From now on, we refer to this set as 𝐷. Figure 1
illustrates the used transformation method, composed of four main stages. Each stage is
described below.

4.1. Text Pre-Processing Stage
The pre-processing techniques applied to 𝐷 depend on the layer we work on (lexical, syntactical,
or semantic layer). Unlike what is mentioned in [7], the only common pre-processing tasks
applied to 𝐷 for all layers are converting all texts to lowercase and tokenizing them for all layers.
We did not remove any stopwords or punctuation symbols to preserve as much information
as possible. We also preserved labels with named entities in the form of  (e.g. streets,
people, organizations, etc.) in the tokenization process. The syntactic layer applies a part-of-
speech tagging (POS tagging) process. From this stage, we obtained three new versions of 𝐷:
𝐷𝑙𝑒𝑥 , 𝐷𝑠𝑦𝑛 , and 𝐷𝑠𝑒𝑚 respectively, one for each text component.
Figure 1: Spectral text representation method.


4.2. Feature Extraction Stage
Given the pre-processed sets 𝐷𝑙𝑒𝑥 , 𝐷𝑠𝑦𝑛 , and 𝐷𝑠𝑒𝑚 , at this stage, the aim is to extract feature
vectors, one per text document for each layer. That said, we obtain three feature vectors for the
same text (𝑥
           ⃗ 𝑙𝑒𝑥 , ⃗𝑥𝑠𝑦𝑛 , and ⃗𝑥𝑠𝑒𝑚 ) corresponding to each of the text components. Each type of
vector must contain features related to each component in question. For this, the extraction of
each of these types of vectors is as follows:
   - Lexical layer: For each text in 𝐷𝑙𝑒𝑥 , we obtain a vector ⃗𝑥 = [𝐼(𝑤0 ), 𝐼(𝑤1 ), . . . , 𝐼(𝑤𝑗 )],
where each column in it corresponds to a word in the vocabulary. The value assigned to each
element of ⃗𝑥 is given by Shannon information content of an outcome 𝑤𝑗 [14], as described in
Equation 1:

                                      𝐼(𝑤𝑗 ) = −𝑙𝑜𝑔2 (𝑝(𝑤𝑗 ))                                    (1)
where 𝐼(𝑤𝑗) is the amount of information that 𝑤𝑗 gives to the text, and 𝑤𝑗 is a word in the
vocabulary. With this approach, those infrequent words are emphazied according to the amount
of information they contribute to the text. In this sense, we consider all words, including
punctuation marks and stop words, as part of the vocabulary. We obtain the set of feature
vectors 𝑋𝑙𝑒𝑥 from this process.
   - Syntactic layer: To extract syntactic features we use the well-known Doc2Vec algorithm
[15]. Although this algorithm is commonly used to extract semantic content vectors, in the
pre-processing stage, a POS tagging process was applied to 𝐷𝑠𝑦𝑛 to obtain POS tag sequences
from the original text. In this way, the Doc2Vec algorithm is expected to capture syntactic
information about the content rather than semantics that can denote a writing style for a given
author. Then, we obtain the set of vectors 𝑋𝑠𝑦𝑛 from 𝐷𝑠𝑦𝑛 .
   - Semantic layer: In this layer, we want to obtain feature vectors from 𝐷𝑠𝑒𝑚 that capture
semantic information. Given this, we resort to the Doc2Vec algorithm once again to extract the
corresponding feature vectors set 𝑋𝑠𝑒𝑚 for this text component.
4.3. Unified Space Mapping Stage
The sets of extracted feature vectors 𝑋𝑙𝑒𝑥 , 𝑋𝑠𝑦𝑛 , and 𝑋𝑠𝑒𝑚 , can have a different number of
dimensions. At this stage, we make use of the Self-Organizing Maps (SOM) [16] to transfer
vectors with a different number of dimensions to a space with the same dimensions where
their similarity is preserved. This type of neural network is well known for mimicking the
distribution of its training feature vectors. A SOM has a single output layer known as lattice,
typically denoting a two-dimensional matrix. Each neuron in this lattice has a vector of weights
associated with the same number of dimensions as its training vectors. For this purpose, we
train a SOM model for each set of vectors 𝑋𝑙𝑒𝑥 , 𝑋𝑠𝑦𝑛 , and 𝑋𝑠𝑒𝑚 . We use the trained SOM
models to obtain projections of the 𝑋 vectors by applying the activation function described in
Equation 2 between each feature vector ⃗𝑥 and each of the weights vectors 𝑤   ⃗ associated with
every neuron of the SOM output lattice.
                                                    1
                                 𝑓 (𝑥
                                    ⃗, 𝑤
                                       ⃗ ) = ∑︀                 1                               (2)
                                                𝑛
                                            ( 𝑖=0 |𝑥𝑖 − 𝑤𝑖 |2 ) 2
where ⃗𝑥 is a feature vector of a given text in its corresponding component layer, 𝑤
                                                                                   ⃗ is the vector
of weights for a given neuron in the SOM lattice, and 𝑛 is the number of dimensions of ⃗𝑥.
   By applying the above activation function, we generate a feature matrix on the SOM lattice
that denotes a specter of the text content in a unified space, one specter for each of the three
text components per document.

4.4. Layer Consolidation
This final stage of the text transformation consists only of taking the three spectra of each text
to consolidate them into a single three-layer text representation containing lexical, syntactical,
and semantic features about the content.

4.5. Authorship Verification
Our proposal to address the authorship verification problem is illustrated in Figure 2. It consists
of transforming each text in the dataset to its corresponding spectral representation. For
each pair of texts in each problem instance, a subtraction is performed between the layers of
each spectrum. Note that the subtraction operation is not commutative, and thus, the sign of
the dissimilarity among documents can change depending on the order of them (𝐴 − 𝐵 or
𝐵 − 𝐴). Our proposal showed better results when the sign or direction of such dissimilarity
was considered. Once calculated 𝐴 − 𝐵, the resulting matrices are flattened to form a single
vector of features that feeds a multilayer perceptron neural network which determines whether
the content in these texts shares lexical, syntactic, and semantic features in common that may
denote a similarity in their authorship.
Figure 2: Authorship verification proposal.


5. Experiments and Results
Following the methodology described in Section 4 we transformed all unique documents into
matrices of size 20 × 20 by training SOM models for each layer. The SOM models were trained
with a learning rate of 0.01 and 1000 epochs each. We used the Doc2Vec algorithm to obtain
vectors of size 300 using a window of 5 words for the syntactic and semantic layer.
   All the implementations were made by using the Keras4 and Tensorflow tools5 . Also, we used
the POS tagger algorithm implemented in NLTK6 , and the Doc2Vec implementation of Gensim7 .
   After the transformation of texts was done, we used the merging strategy shown in Figure 2
to train a multilayer perceptron neural network with the following architecture:

    • A first dense layer with 600 neurons.
    • A ReLU activation function.
    • An L2 kernel regularizer.
    • A first drop out layer with value of 0.4.
    • A second dense layer with 300 neurons.
    • A ReLU activation function.
    • An L2 kernel regularizer.
    • A second drop out layer with value of 0.4.
    • A third dense layer with 1 neuron.
    • A Sigmoid activation function.

   Given that this task is a binary classification problem, we used the binary cross entropy as a
loss metric to guide the algorithm during the training process. This model was trained with a

   4
     keras.io
   5
     www.tensorflow.org
   6
     www.nltk.org
   7
     radimrehurek.com/gensim/models/doc2vec.html
total of 100 epochs using the Adam optimizer. We combined 1, 2, and 3 spectral layers from
texts to train this model for completeness. We used the test partition to validate the training
results with the metrics established in the shared task. Table 4 summarizes the results of these
experiments, where column auc is the conventional area-under-the-ROC-curve [17], c@1 is a
variant of the conventional F1-score [18], f_05_u is a measure that emphasizes correctly deciding
same-author cases [19], F1 score is the harmonic mean of precision and recall [20], brier is the
complement of the Brier score [21], and overall is the average of all these metrics.

Table 4
Results of the classification model using different SOM layers and its combinations on our testing set.
              SOM layers               auc     c@1     f_05_u     F1     brier    overall
              all                    0.592    0.593    0.596    0.611    0.756    0.630
              lexical-syntactical    0.596    0.584    0.584    0.581    0.756    0.620
              syntactical-semantic   0.512    0.509    0.471    0.420    0.744    0.531
              lexical-semantic       0.597    0.593    0.593    0.598    0.755    0.627
              lexical                0.594    0.565    0.556    0.523    0.754    0.598
              syntactical            0.491    0.500    0.556    0.667    0.750    0.593
              semantic               0.493    0.503    0.557    0.668    0.750    0.594

  From the previous table, we can highlight that the best results on average are obtained when
using the three layers of the text representation. This suggests that using all the information
provided by the different layers together allows the classifier to better discern between the
positive and negative cases in the problem. Using the lexical and semantic spectra exhibits the
best results for the cases of two-layer combinations. This indicates that the vocabulary and the
main idea support the decision-making more than using the syntax of the content, suggesting
that the topic discussed in the text is important as well as the used vocabulary. Finally, when
using a single layer, the best results on average are obtained by using only the lexical specter.
This may lead to the fact that the vocabulary used by the different authors in the dataset is
crucial to determining the separability of the problem. Given the obtained results, we selected
the model trained with all the layers of the text representation to deploy it on the TIRA platform
for use with the task test corpus [22].


6. Conclusions
In this paper, we addressed the authorship verification problem by using a content spectral-based
representation to train a multilayer perceptron neural network to classify those cases in which
the same author produced pairs of texts. Our classification model achieved better results when
using all text content spectra. However, even using this information, the highest average score
is 0.630. We think we can improve this performance if additional information, such as the type
of discourse of the texts, is included as a feature into the model. This variable was not used
in our experiments. However, in future work, using such information could help to ensure a
better separability of classes.
Acknowledgments
This work has been carried out with the support of CONACyT projects CB A1-S-27780, DGAPA-
UNAM PAPIIT numbers TA400121 and TA101722, and CONACYT 709733 scholarship. The
authors thank CONACYT for the computing resources provided through the Deep Learning
Platform for Language Technologies of the INAOE Supercomputing Laboratory. We thank Eng.
Roman Osorio for his help in managing the project’s students.


References
 [1] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl,
     R. Ortega-Bueno, P. Pęzik, M. Potthast, et al., Overview of pan 2022: Authorship ver-
     ification, profiling irony and stereotype spreaders, style change detection, and trigger
     detection, in: European Conference on Information Retrieval, Springer, 2022, pp. 331–338.
 [2] E. Stamatatos, M. Kestemont, K. Kredens, P. Pezik, A. Heini, J. Bevendorff, M. Potthast,
     B. Stein, Overview of the Authorship Verification Task at PAN 2022, in: CLEF 2022 Labs
     and Workshops, Notebook Papers, CEUR-WS.org, 2022.
 [3] J. Weerasinghe, R. Singh, R. Greenstadt, Feature vector difference based authorship
     verification for open-world settings., in: CLEF (Working Notes), 2021, pp. 2201–2207.
 [4] M. Pinzhakova, T. Yagel, J. Rabinovits, Feature similarity-based regression models for
     authorship verification., in: CLEF (Working Notes), 2021, pp. 2108–2117.
 [5] E. Araujo-Pino, H. Gómez-Adorno, G. F. Pineda, Siamese network applied to authorship
     verification., in: CLEF (Working Notes), 2020.
 [6] G. Verma, B. V. Srinivasan, A lexical, syntactic, and semantic perspective for understanding
     style in text, arXiv preprint arXiv:1909.08349 (2019).
 [7] M. Crespo-Sanchez, I. Lopez-Arevalo, E. Aldana-Bobadilla, A. Molina-Villegas, A content
     spectral-based text representation, Journal of Intelligent & Fuzzy Systems (2022) 1–12.
 [8] P. Juola, Authorship attribution, volume 3, Now Publishers Inc, 2008.
 [9] Z. Peng, L. Kong, Z. Zhang, Z. Han, X. Sun, Encoding text information by pre-trained
     model for authorship verification., in: CLEF (Working Notes), 2021, pp. 2103–2107.
[10] X. Miao, H. Qi, Z. Zhang, G. Cao, R. Lin, W. Lin, Dual neural network classification based
     on bert feature extraction for authorship verification., in: CLEF (Working Notes), 2021, pp.
     2069–2072.
[11] R. Futrzynski, Author classification as pre-training for pairwise authorship verification.,
     in: CLEF (Working Notes), 2021, pp. 1945–1952.
[12] A. Menta, A. Garcia-Serrano, Authorship verification with neural networks via stylometric
     feature concatenation., in: CLEF, 2021.
[13] J. Tyo, B. Dhingra, Z. C. Lipton, Siamese bert for authorship verification., in: CLEF
     (Working Notes), 2021, pp. 2169–2177.
[14] D. J. MacKay, D. J. Mac Kay, Information theory, inference and learning algorithms, Cam-
     bridge university press, 2003.
[15] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: International
     conference on machine learning, PMLR, 2014, pp. 1188–1196.
[16] T. Kohonen, The self-organizing map, Proceedings of the IEEE 78 (1990) 1464–1480.
[17] A. P. Bradley, The use of the area under the roc curve in the evaluation of machine learning
     algorithms, Pattern recognition 30 (1997) 1145–1159.
[18] A. Peñas, A. Rodrigo, A simple measure to assess non-response (2011).
[19] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Generalizing unmasking for short texts, in:
     Proceedings of the 2019 Conference of the North American Chapter of the Association for
     Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
     Papers), 2019, pp. 654–659.
[20] B. Wang, C. Li, V. Pavlu, J. Aslam, A pipeline for optimizing f1-measure in multi-label
     text classification, in: 2018 17th IEEE International Conference on Machine Learning and
     Applications (ICMLA), IEEE, 2018, pp. 913–918.
[21] G. Blattenberger, F. Lad, Separating the brier score into calibration and refinement compo-
     nents: A graphical exposition, The American Statistician 39 (1985) 26–32.
[22] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
     in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
     Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/
     978-3-030-22948-1\_5.