Graph-based Siamese Network for Authorship Verification Notebook for PAN at CLEF 2021 Daniel Embarcadero-Ruiz1 , Helena Gómez-Adorno2 , Ivan Reyes-Hernández3 , Alexis García4 and Alberto Embarcadero-Ruiz1 1 Posgrado en Ciencia e Ingeniería de la Computación, Universidad Nacional Autónoma de México, Ciudad de México, México 2 Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Ciudad de México, México 3 Facultad de Ciencias, Universidad Nacional Autónoma de México, Ciudad de México, México 4 Facultad de Estudios Superiores Acatlán, Universidad Nacional Autónoma de México, Ciudad de México, México Abstract Authorship verification is the task of determining whether or not two texts were written by the same author. The PAN@CLEF 2021 Authorship Verification challenge [1] requires to solve the task on a cross- topic and open-set collection of fanfiction texts. We propose a novel approach to extract features from text by first modeling it as a graph and then using a graph neural network to extract relevant features. We use a Siamese Network Architecture because it has shown good generalization on unseen classes in previous work related to verification tasks. Keywords Authorship verification, Graph neural networks, Text graphs 1. Introduction Authorship verification is the task of determining whether or not two texts were written by the same author. To solve this task, we propose to represent the text as a graph structure and let a graph neural network to extract relevant features from this. Our motivation is that the graph structure provides additional information contained in the text that cannot be obtained when it is processed in the traditional sequential manner. Our model uses a Siamese network architecture [2]. This network has two input layers to compare, in our case texts samples, and one output layer which its state value corresponds to the similarity between the two inputs. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " danielembru@gmail.com (D. Embarcadero-Ruiz); helena.gomez@iimas.unam.mx (H. Gómez-Adorno); ivanreyes@ciencias.unam.mx (I. Reyes-Hernández); 311115781@pcpuma.acatlan.unam.mx (A. García); albertoemru@gmail.com (A. Embarcadero-Ruiz) ~ https://helenagomez-adorno.github.io (H. Gómez-Adorno)  0000-0003-1904-1606 (H. Gómez-Adorno) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Related work The problem of authorship verification was first tackled with linguistic approaches and eventu- ally by statistic and computational methods [3]. In general, the most used strategy is to extract features from the text and use them to train a classification algorithm, which can be based either on supervised-learning or similarity. The features extracted from the texts are usually lexical – i.e. bag of words, vocabulary richness, misspelling words, character-level – n-grams, syntactic – ie. POS, chunks, sentence segmentation, and semantic – i.e. dependency trees, synonyms – [4]. Some classifiers used in this problem are support vector machines, decision trees, discriminant analysis, neural networks, and genetic algorithms [5]. The use of a heterogeneous classifier that combines independent classifiers each one with a different approach have achieved good results, usually better than the ones obtained using a single classifier [5, 6]. Before 2015, evaluation was mainly carried out using datasets with training and testing documents that shared the same topic and same genre. It has been observed that, in cross-topic datasets the performance of the traditional approaches decreases [6]. There are several graph-based representations used to model documents. The general ap- proach consists in identifying relevant elements in the text - ie. words, sentences, paragraphs, etc. - and considering them as nodes in the graph. Then meaningful relations between these elements are generated as edges. Traditionally, the elements used as nodes in the graph are: words, sentences, paragraphs, documents, and concepts. To define the edges, usually syntactic, semantic relations and statistical counts are used [7]. In [8], the authors introduced different graph representations for authorship analysis, par- ticularly suggested an enriched co-occurrence graph for authorship verification. Another graph-based approach for document understanding is presented in [9]. This approach considers different levels of linguistic analysis, such as lexical, morphological, syntactical and semantical, in order to build a graph representation of a given document. The authors also introduce a technique for extracting useful text patterns based on shortest paths. Siamese Neural Networks (SNN) were first presented by Bromley et al. [2] to solve the problem of signature verification. Their formulation defines two separate sub-networks acts on each input pattern to extract features, then they use the cosine of the angle between two feature vectors to assign a distance between the compared instances. Koch et al. [10] proposed a Siamese convolutional network to solve face verification. Instead of use a contrastive loss as objective loss they use a fully connected layer followed by a sigmoid function on top of the Siamese Network to get a prediction and optimize a cross entropy loss with 𝐿2 normalization. Last year, two approaches used SNNs to solve the PAN Authorship Verification Task [11]. Boenninghoff et al. [12] used an architecture of LSTMs with attention coefficients to extract first sentence embeddings and after a document embedding. Araujo-Pino et al. [13] used an adapted Residual Network with n-gram vectors as input. 3. Authorship Verification Dataset at PAN 2021 The dataset provided by the PAN@CLEF [14] organization was available in two sizes: the small dataset has 52,601 predefined pairs of texts and the large one has 275,565 pairs of texts. Each problem is composed of two fanfiction texts; a fanfiction is a original fan written history which extends the universe of a fandom topic. All texts are written in English, each one has an average of 21,400 characters, 4950 tokens and 345 sentences. This year, the Authorship Verification task [1] focused on a cross-topic and open-set scenario. The test dataset has texts of authors and topics never seen in the training dataset. We submitted two models, both with the same architecture but trained using the small or large dataset. To develop our model, we conducted several experiments, and for these we need to split the dataset in three parts: train, validation, and test. We trained our model on the train split using the validation split to calibrate the hyperparameters, and used the test split to get a reference score of the model. We did not use any of the samples in the test split to calibrate our model, so the score in this set tells us about the generalization ability of the model. Our splits were done using the same fixed pairs given by the dataset. We made these splits author disjoint, that is, no text in one split has the same author of any text in a different split. At first we tried to do the splits also topic disjoint but it was difficult to achieve because of the natural distribution of topics and authors over pairs of texts. All the splits were defined with a balanced proportion of true and false problems. After develop and calibrate our models we deployed them on TIRA [15] for testing it. 4. Modeling Texts as Graphs We represent each text as a graph, our graph attempt to capture the relationships between words and POS labels in the text. To make clear our process we take the next text as an example: Momo, also known as The Grey Gentlemen or The Men in Grey, First, each text is preprocessed in this way: 1. Substitute no ASCII characters to ASCII equivalent (we employed unidecode package1 ). 2. Tokenize and get the POS label. 3. Normalize to lowercase. We don’t remove punctuation of any type, in fact, the non-ASCII character substitution was made hopping it reduce the variability of the punctuation used. We consider the PENN-Treebank POS 2 labels as used in the NLTK package; to get the POS labels we use first the NLTK package and after that we add two additional labels: $PUNCT to mark all punctuation and $OTHER to mark any other word that the NLTK model failed to identify. After the preprocessing we obtain a parsed text like this: [(’momo’, ’NNP’), (’,’, ’$PUNCT’), (’also’, ’RB’), (’known’, ’VBN’), (’as’, ’IN’), (’the’, ’DT’), (’grey’, ’NNP’), (’gentlemen’, ’NNP’), (’or’, ’CC’), 1 https://github.com/avian2/unidecode 2 https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html (a) (b) (c) Figure 1: (a) Short graph, (b) Med graph, (c) Co-ocurrence graph , equivalent to the graph generated with empty REDUCE LABLES set (’the’, ’DT’), (’men’, ’NNPS’), (’in’, ’IN’), (’grey’, ’NNP’), (’,’, ’$PUNCT’)] We will define a directed graph 𝐺 = {𝑉, 𝐸} with tuples (word, pos) as nodes and weighted edges. We need to define the node set 𝑉 and the edge set 𝐸, where each element in 𝐸 is a tuple (𝑛1 , 𝑛2 , 𝑤) with 𝑛1 , 𝑛2 ∈ 𝑉 nodes and 𝑤 ∈ R the edge weight. The construction is based in a co-occurrence graph, as explained in [7] with a different edge weighting; we used the Networkx3 package to construct the graphs. Let 𝑃 be the parsed text as a list of tuples, ℓ(𝑃 ) the number of elements in the list and 𝑃 [𝑖] the i-th element in the list. Our graph construction is made as follow: We define a set of POS labels called REDUCE_LABELS; for each 𝑃 [𝑖] = (𝑤𝑜𝑟𝑑, 𝑝𝑜𝑠) in 𝑃 , we can define: {︃ (𝑤𝑜𝑟𝑑, 𝑝𝑜𝑠) if pos ̸∈ REDUCE_LABELS 𝑀 [𝑖] = (𝑝𝑜𝑠, 𝑝𝑜𝑠) if pos ∈ REDUCE_LABELS Where 𝑀 is the list defined by the tuples masked as explained. For each pair of tuples 𝑇1 , 𝑇2 ∈ 𝑀 let be 𝑓 (𝑇1 , 𝑇2 ) the number of times 𝑇1 is followed by 𝑇2 in 𝑀 and let be 𝑇 = 𝑙(𝑃 ) − 1 = 𝑙(𝑀 ) − 1; note that 𝑇 is the total number of times a pair of tuples co-occur in 𝑀 . Now we can define the nodes and edges of our graph: 𝑉 = {𝑇 | 𝑇 ∈ 𝑀 } 3 https://networkx.org/ Note than 𝑀 is a list with order and 𝑉 is just the set of all tuples in 𝑀 . We want to define an edge between any two nodes (tuples) that appear together in the list 𝑀 : 𝑓 (𝑇1 , 𝑇2 ) 𝐸 = {(𝑇1 , 𝑇2 , ) | 𝑇1 , 𝑇2 ∈ 𝑀 ∧ 𝑓 (𝑇1 , 𝑇2 ) > 0} 𝑇 With this construction we identify all tuples form parsed text having an specific label in REDUCE_LABELS as a single node, so the graph structure changes and with it the information abstracted from text. In our experiments we evaluated graphs generated with different REDUCE_LABELS sets. From now we will denominate short graph to the graph generated using the set of all possible POS labels as REDUCE_LABELS. We will denominate med graph to the graph generated the following set of REDUCE_LABELS: REDUCE_LABELS = [’CD’, # Cardinal numbers ’FW’, # Foreign words ’JJ’, ’JJR’, ’JJS’, # Adjectives ’LS’, # List item marker ’NN’, ’NNS’, ’NNP’, ’NNPS’, # Nouns ’RB’, ’RBR’, ’RBS’, # Adverbs ’SYM’, # Symbols ’VB’, ’VBD’, ’VBG’, ’VBN’, ’VBP’, ’VBZ’, # Verbs ’$OTHER’ # Others] Continuing with our example, the med graph and the short graph are showed in 1(a) and 1(b) respectively. To make clearer the construction process Figure 1(c) show the construction with empty REDUCE_LABLES set. For simplicity we don’t draw the edge weights. To feed our graph to a neural network, we need to transform the node names into vectors. We use a one hot encoder of the POS label as the initial node features. 5. Graph-based Siamese Network (GBSN) To approach the authorship verification task, we use a Siamese Network architecture. Our Graph-based Siamese Network is composed by two identical feature extraction components with shared weights, a reduction step and a classification network. Each feature extraction component receives a text, transform it to a graph and returns a vector from this graph; the objective is to extract relevant features than can identify the author style from the graph representation of our text. Figure 2 shows our proposed architecture. We can distinguish three parts in the feature extraction component: first graph representation, second node embedding layers and later a global pooling. The graph representation is constructed as explained in Section 4. The node embedding layers obtain a relevant vector representation of each node in the graph; each layer is composed of a Local Extreme Convolution (LEConv) as defined in [16], followed by a batch normalization layer and a ReLU activation function, in total we use six of these layers. This layer was originally proposed to compute scores used to select relevant clusters in a graph, Figure 2: GBSN base architecture the authors prove that it is more expressive than other layers like the Graph Convolutional Network layer as defined by Kipf and Welling [17] and affirm it has the ability of considering both local and global importance of nodes. For the pooling layer we choose to use global attention layer, originally proposed by Li et al. [18]. This layer takes the final out of the node feature extraction component as its input, that is, a graph with the vector embedding in each node; to obtain the final vector, it makes a weighted sum of each node vector with a coefficient obtained by doing attention over these same vectors, the formulation is: 𝑟 = Σ𝑛∈𝑉 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(ℎ(𝑥𝑛 )) · 𝑥𝑛 Where 𝑉 is the set of all nodes in the graphs and ℎ is a four layer fully connected neural network with ReLU activation and output a single scalar. For the reduction step, we simply compute the absolute value of the difference between the out of each feature extraction component, this single vector is passed to a final classification network. The classification network used is a five layer fully connected network with ReLU activation and final sigmoid function. For training, we use ADAM optimizer and binary cross entropy as loss function, in each epoch all the pairs in the train split are supply to the Siamese network in shuffled order. We calculate the loss and average of the five scores considered for evaluation in the validation split, with these we select the best epoch of the model and choose it as our best model. Our model was trained to return an output between 0 and 1. Given a threshold 𝑡ℎ and a margin 𝑚 we can transform the original output 𝑜𝑢𝑡𝑜 to a new one with the formula: ⎧ 𝑜𝑢𝑡𝑜 ⎪ ⎨ 2·(𝑡ℎ−𝑚) ⎪ 𝑖𝑓 𝑜𝑢𝑡𝑜 < 𝑡ℎ − 𝑚 𝑙(𝑜𝑢𝑡𝑜 ) = 0.5 𝑖𝑓 𝑡ℎ − 𝑚 ≤ 𝑜𝑢𝑡𝑜 ≤ 𝑡ℎ + 𝑚 ⎩ 𝑜𝑢𝑡𝑜 −1 ⎪ 2·(1−(𝑡ℎ+𝑚)) + 1 𝑖𝑓 𝑡ℎ + 𝑚 < 𝑜𝑢𝑡𝑜 ⎪ This linearly transforms the interval [0, 𝑡ℎ − 𝑚) into [0, 0.5), the interval [𝑡ℎ − 𝑚, 𝑡ℎ + 𝑚] into 0.5 and the interval (𝑡ℎ + 𝑚, 1] into (0.5, 1]. Using this transformation, as a final tune, we preformed a grid search over different threshold and margins to optimize the average of the scores proposed in the PAN@CLEF shared task [1]. 5.1. Ensemble architecture We notice that using more than one graph representation and adding traditional stylistic features can improve the performance of our model (See Table 1). We upgrade our base architecture putting side by side different feature extraction components in a new Graph-based Siamese Network Ensemble architecture. Each subnetwork in the Siamese architecture is now formed by several independent feature extraction components, each one getting a vector from the text, as shown in Figure 3. We concatenate the vectors obtained by each component and this one is used in the reduction step. Figure 3: Graph-based Siamese Network Ensemble architecture For each feature extraction components we use the same components used in the base architecture, so these components will be identified as small graph or med graph components. Also, we may include a vector of stylistic features of texts as another independent component. The stylistic features were selected from the ones used by Weerasinghe and Greenstadt [19]. For simplicity we use only: • Frequency of function words (179 function words in the NLTK list) • Average number of characters per word • Vocabulary richness • Frequency of words of n characters (from 1 to 10 characters) We evaluated two training strategies: 1) to train all the network together, and 2) transferring the weights from a base instance of the corresponding GBSN. The later means training first one GBSN with only the short graph component and another one with the med graph component and define the ensemble GBSM using the feature extraction components weights in the corre- sponding weights. Transferring the weights achieved better results, so we define our model in this way; this behavior was probably because training both components together did not let each component to stop at the best individual weight configuration due to our experimental framework. In our experiments, we notice that using an ensemble of two instances of the same graph representation also gives us better results than using a single instance. For our final submissions, we use an ensemble of five feature extraction components in total: Two instances of a short graph feature extraction, two instances of a med graph features extraction and the stylistic features. Figure 4 shows our submitted architecture. Figure 4: Final model submission: Ensemble with two small graph, two med graph and stylistic features 6. Results Table 1 shows the comparative performance of the different configurations, each row shows the average of the five proposed scores in a model, trained in the small and large dataset respectively. The first row corresponds to a base architecture model using only the short graph component for feature extraction. The second row corresponds to a base architecture model using only the med graph component. The third row shows the scores of an ensemble architecture model using both short graph and med graph components but without stylistic features. The fourth row shows the scores of an ensemble architecture model using short graph, med graph and stylistic features components. Finally the fifth row shows the scores of our submitted model, an ensemble architecture with two short graph components, two med graph components and stylistic features. Finally, our two submissions were scored in the test dataset of the PAN 2021 Authorship Verification task [1]. We obtained the second best score for the model trained in the large dataset Table 1 Performance of the Graph-based Siamese Network (GBSN) with single and ensemble feature extraction components in our test split dataset Small dataset Large dataset 47,336 problems 247,992 problems Short graph component 87.81 90.63 Med graph component 88.31 91.50 Short + Med 89.21 92.38 Short + Med + Features 89.8 NA Short(x2) + Med(x2) 90.03 92.96 + Features (Submitted) and the third best score for the model trained in the small dataset. Table 2 shows the average of the five metrics proposed. We want to note that the scores in the test dataset were higher that the ones obtained in our test split. This was probably because we made our splits author disjoint, so in our experiments, we found models capable to solve the task for authors never seen in training. Table 2 Performance of the Graph-based Siamese Network (GBSN) ensemble in the test dataset of PAN 2021 Small dataset Large dataset Short(x2) + Med(x2) 90.70 93.59 + Features (Submitted) 7. Conclusions In this paper, we presented our approach for the authorship verification task at PAN 2021 [1], which is based on a novel Graph-based Siamese Network. In our experiments, we observed a good performance and great potential of using a graph representation of the text, graph convolutional neural networks and a siamese architecture. We want to notice that the convolutional layers used in our model are relatively simple. We experimented by varying the architecture and hyperparameters to improve the model, but there are a lot of different convolutional and pooling graph layers configurations to try in future work. We do not use any up-sampling technique over the given dataset. In fact, because of our train framework, for each dataset, we train our model using only 90% of the available data; that is, the small model was training using 47,336 problems and the large model using 247,992 problems; this is relevant because our architecture showed to work better with more training pairs. Acknowledgments This research was partially funded by CONACyT PNPC scholarship with No. CVU 1004062, DGAPA-UNAM PAPIIT grant number TA100520. The authors thank CONACYT for the computer resources provided through the INAOE Supercomputing Laboratory’s Deep Learning Platform for Language Technologies. References [1] M. Kestemont, I. Markov, E. Stamatatos, E. Manjavacas, J. Bevendorff, M. Potthast, B. Stein, Overview of the authorship verification task at PAN 2021, in: [14], 2021. [2] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature Verification using a "Siamese" Time Delay Neural Network, International Journal of Pattern Recognition and Artificial Intelligence 7 (1993) 669–688. doi:10.1142/S0218001493000339. [3] S. Mekala, V. V. Bulusu, A Survey On Authorship Attribution Approaches, in: International Journal of Computational Engineering Research (IJCER), volume 8, 2018, p. 8. [4] E. Stamatatos, A survey of modern authorship attribution methods, Journal of the American Society for Information Science and Technology 60 (2009) 538–556. doi:10.1002/asi. 21001. [5] E. Stamatatos, W. Daelemans, B. Verhoeven, M. Potthast, B. Stein, P. Juola, M. A. Sanchez- Perez, A. Barrón-Cedeño, Overview of the Author Identification Task at PAN 2014, Proceedings of the CLEF PAN Conference (2014) 21. [6] E. Stamatatos, W. Daelemans, B. Verhoeven, P. Juola, A. López-López, M. Potthast, B. Stein, Overview of the Author Identification Task at PAN 2015, Proceedings of the CLEF PAN Conference (2015) 17. [7] S. S.Sonawane, P. A. Kulkarni, Graph based Representation and Analysis of Text Document: A Survey of Techniques, International Journal of Computer Applications 96 (2014) 1–8. doi:10.5120/16899-6972. [8] E. Castillo, O. Cervantes, D. Vilariño, Text Analysis Using Different Graph-Based Repre- sentations, Computación y Sistemas 21 (2017) 581–599. doi:10.13053/cys-21-4-2551. [9] D. Pinto, H. Gomez Adorno, D. Vilariño, V. Singh, A graph-based multi-level linguistic representation for document understanding, Pattern Recognition Letters 41 (2014) 93–102. doi:10.1016/j.patrec.2013.12.004. [10] G. Koch, R. Zemel, R. Salakhutdinov, Siamese Neural Networks for One-shot Image Recognition, ICML deep learning workshop 2 (2015) 8. [11] M. Kestemont, E. Manjavacas, I. Markov, J. Bevendorff, M. Wiegmann, E. Stamatatos, M. Potthast, B. Stein, Overview of the Cross-Domain Authorship Verification Task at PAN 2020 (2020) 14. [12] B. Boenninghoff, J. Rupp, R. M. Nickel, D. Kolossa, Deep Bayes Factor Scoring for Author- ship Verification, Notebook for PAN at CLEF 2020 (2020) 12. [13] E. Araujo-Pino, H. Gómez-Adorno, G. Fuentes-Pineda, Siamese Network applied to Authorship Verification, Notebook for PAN at CLEF 2020 (2020) 8. [14] J. Bevendorff, B. Chulvi, G. L. D. L. P. Sarracén, M. Kestemont, E. Manjavacas, I. Markov, M. Mayerl, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wol- ska, a. E. Zangerle, Overview of PAN 2021: Authorship Verification,Profiling hate speech spreaders on Twitter,and style change detection, in: 12th International Conference of the CLEF Association (CLEF 2021), Springer, 2021. [15] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA integrated research architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\\_5. [16] E. Ranjan, S. Sanyal, P. P. Talukdar, ASAP: Adaptive Structure Aware Pooling for Learn- ing Hierarchical Graph Representations, Thirty-Fourth AAAI Conference on Artificial Intelligence (2020). arXiv:1911.07979. [17] T. N. Kipf, M. Welling, Semi-Supervised Classification with Graph Convolutional Networks, Conference paper at ICLR (2017). arXiv:1609.02907. [18] Y. Li, D. Tarlow, M. Brockschmidt, R. Zemel, Gated Graph Sequence Neural Networks, Proceedings of ICLR (2017). arXiv:1511.05493. [19] J. Weerasinghe, R. Greenstadt, Feature Vector Difference based Neural Network and Logistic Regression Models for Authorship Verification, Notebook for PAN at CLEF 2020 (2020) 6.