Author classification as pre-training for pairwise authorship verification Notebook for PAN at CLEF 2021 Romain Futrzynski1 1 Peltarion, Holländargatan 17, 111 60 Stockholm, Sweden Abstract In this paper, we propose to use a standard BERT model for the PAN 2021 Authorship Verification task where two texts must be determined to either have the same or different authors. The model is chiefly trained to classify short sequences of text as belonging to one of three thousand authors selected from the large training dataset. Additional tasks are also used simultaneously during training in order to capitalize on the information available, namely, a masked language model task, a fandom classification task, and an author-fandom separation task. To perform Authorship Verification, an embedding is ex- tracted from the trained BERT model. In order to reduce the computational cost, only a short sample of text is processed by BERT, but the same text is sampled a hundred times at random locations, and the embeddings from each sample are reduced to a single representation using the median. The represen- tations from two texts are compared by cosine similarity, which is rescaled empirically so that most of the ambiguous pairs lie on the 0.5 threshold. Evaluated on authors and topics absent from the training dataset, this model achieved F1=0.832 and AUC=0.798. Keywords BERT, similarity 1. Introduction Authorship verification [1] is one of the shared tasks at PAN 2021 [2]. The purpose of this task is to determine whether or not two texts from a given pair were written by the same author, without any prior knowledge about the authors or the topics of the texts. The texts used for authorship verification are stories extracted from www.fanfiction.net happening within a fandom (i.e., a popular fictional universe such as Harry Potter, Twilight, True Blood), written by fans of this fandom. The particular fandom of a text is known during training, but it is not available to models during evaluation on the test set. Moreover, both the authors and fandoms included in the training set are different from those included in the test set. The method proposed here is to train a standard BERT [3] model using an author classification task. Since the authors in the unseen test set are different from the authors available for training, the model cannot be used to directly identify specific authors. Instead, a representation of an input text, in the form of an embedding vector, is extracted from the model before its CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " romain@peltarion.com (R. Futrzynski) ~ https://fromain.medium.com/ (R. Futrzynski)  0000-0002-3194-5141 (R. Futrzynski) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) classification layer. By using a large number of authors during training, the model learns to produce representations of text that can be compared via cosine similarity to determine if they share the same author. In addition, the model is trained simultaneously on a masked language model task, a fandom classification task, and an author-fandom separation task in order to induce desirable properties in the model. 2. Dataset preprocessing The dataset used for training is the large training set from the CLEF 2020 authorship verification task [4]. This dataset is composed of text pairs labeled with whether they were written by the same or different authors. An anonymous author identifier and the fandom are also provided for each text of the pair. In order to train the model using a classification task, the dataset is first reorganized as a list of 493,296 unique texts labeled with their author and fandom. A single author may have written between 1 and up to 30 texts, within 1 to 6 fandoms. In order to provide as much diversity as possible while not slowing down the training time, a training split is created by gathering every text from the authors who have written within exactly 6 fandoms, resulting in a training split containing 72,471 distinct texts, from 3,353 distinct authors, and covering 1,471 distinct fandoms. A validation split is also created from the training texts. Only authors who have written within 5 fandoms or less may be included in the validation split, ensuring that the authors from the training and validation splits are distinct. The validation split is structured as text pairs in order to to be evaluated in a fashion similar to the end task. To constitute the validation split, 2,500 text pairs are sampled at random ensuring that they were written by different authors, and 2,500 text pairs are sampled at random ensuring that both texts were written by the same author, and furthermore ensuring that the two texts from a given pair are distinct. The validation split therefore contains 5,000 text pairs, for a total of 9,816 distinct texts, within 1,253 distinct fandoms, from 7,216 distinct authors who do not overlap with the authors of the training split. 3. Model description The model used relies on the BERT [3] architecture and English pretraining as provided by the bert-base-cased 1 model from the transformers Python module. 3.1. Model input This model is given a tokenized text as input. The texts contained in the dataset are often full stories, commonly reaching over 5,000 tokens. Since the computation time required by transformer models grows quadratically with the length of the input sequence, the model is only given 28 consecutive tokens, picked at a random location in the text which is independent for every text and every epoch. Besides decreasing the computation time, it is expected to force the model to focus on brief writing patterns. The size of 28 aims to promote focusing on such 1 https://huggingface.co/bert-base-cased short patterns, while still providing enough tokens to let the model leverage the contextual understanding of self-attention. The sequence of tokens is also prepended by a CLS token whose embedding is intended to be used for classification tasks, and appended with a SEP token which brings to the total model input size to 30 tokens. Although the model is always given texts one at a time, the SEP token is added as a way for the model to store information without affecting any of the classification or language modeling tasks used during training, and to resemble the pretraining setup more closely. 3.2. Model output For classification tasks during training, and for the authorship verification task during validation and test, the pooler vector is used. This vector is the result of passing the embedding of the CLS token through a linear layer followed by a hyperbolic tangent activation function. The first half of this pooler vector, i.e. its first 384 values, serves as an embedding of the author, which is used for author classification during training and for similarity evaluation during validation and test. The second half of the pooler vector, i.e. its last 384 values, is used only for fandom classification during training. 4. Training procedure The model is trained using the four tasks described below. The optimizer used is AdamW [5] with a learning rate of 2e-5, and a batch size of 250. The linear layers used for the author and fandom classification training tasks use a different, higher learning rate of 1e-3. The model is trained for 20 epochs. 4.1. Language model task During training, 10% of the input tokens are randomly replaced by a mask token. This differs from the 15% rate used in the original pretraining in order to avoid masking too much of the relatively short token sequences. The language model head of BERT is run on the sequence of token embeddings, and its ability to recover the original token for every masked token is tuned using a crossentropy loss. The purpose of the language model task is to prevent catastrophic forgetting of the original pre-trained weights, and to promote learning of author-specific words and idioms. 4.2. Author classification The model is simultaneously trained to identify the author of every text in the training split as a classification task. For this purpose, the model uses the first half of the pooler vector, passed to a linear layer with an output size corresponding to the number of unique authors in the training split, that is, 3,353. No activation function and no bias term are used in this linear layer, in order to promote the learning of 384-size embedding vectors that are more directly suitable for comparison using cosine similarity. The performance of the author classification task is tuned using the crossentropy loss function. 4.3. Fandom classification Similarly, the model is trained to identify the fandom of every text in the training split as another classification task. For this task, the second half of the pooler vector is used. This is passed to another linear layer with an output size of corresponding to the number of unique fandoms in the training split, that is, 1,471. No activation function is used in this layer, although the bias term is enabled. The purpose of this task is to promote fandom awareness, and may possibly contribute to improving the language model in conjunction with the author classification task. The performance of the fandom classification task is tuned using the crossentropy loss function. 4.4. Author-fandom separation Since the end task is to verify authorship independently from fandom, it is undesirable that the model uses its predictions about fandom in order discriminate different authors. To counter this phenomenon, a simple network is trained to classify fandoms from the first half of the pooler vector, which is normally intended to embed authors only. This network is made of a linear layer with bias and output size of the full pooler vector, i.e. 768, followed by the SELU [6] activation function, a 10% dropout layer, and a last linear layer projecting to the number of fandoms without bias. This network is trained using a crossentropy loss in parallel but independently from the main model, therefore using the same training batches but having its own gradient updates. Then, the crossentropy loss from this network is recalculated as part of the main model training. This loss is scaled by a coefficient of 0.1 and subtracted from the sum of the losses from three other tasks. This creates the total loss that is used for training. The reason for scaling down the loss of this task is that since every author in the training split has contributed to only 6 fandoms, good fandom classification could reasonably be expected even if the authors were embedded using style consideration only. 5. Validation procedure The validation task is to receive two texts, and return whether or not the author is the same. As this differs from any of the training tasks, a different procedure is used. A single text may easily range in the thousands of tokens, whereas the model is trained using a sequence length of 30 tokens. To proceed, 100 short sequences of 28 tokens are sampled at independent random locations within a text, and the CLS and SEP tokens are added similarly to the training format. The model then runs a forward pass, and the first half of the pooler output vector is stored for each of these 100 sequences as an embedding of the sequence. These embeddings are then reduced to a single 384-component vector using the median of each component. The same steps are repeated for the second text of a pair, yielding another median-reduced vector. Table 1 Comparison of using a 10 and 384 embedding size, measured after 20 epochs on the validation set. AUC c@1 F1 F0.5u Overall 10-component embedding 0.830 0.760 0.771 0.737 0.781 384-component embedding 0.864 0.802 0.817 0.757 0.810 Finally, the two reduced vectors are compared using cosine similarity. Since the task rewards models that answer 0.5 as an uncertain result, it is desirable to scale the cosine similarity scores so that similar and dissimilar authors lie on each side of 0.5. For this purpose, the ROC curve on the validation split is plotted every epoch, and the threshold corresponding to the halfway step is monitored, giving an approximation of the middle of the curve. After 20 epochs, the threshold is 0.6592 which is rounded to 0.65. Therefore, during the test run, the values of cosine similarity between 0 and 0.625 are linearly rescaled between 0 and 0.5; the values between 0.675 and 1 are rescaled between 0.5 and 1; the values between 0.625 and 0.675 are collapsed to 0.5. The rescaled cosine similarity is reported as the answer to the test task. 6. Results The evolution of the model performance over epochs is reported here. In addition to the model described which uses an embedding size of 384, half the embedding size of BERT, progress is also reported for another model using only 10 components in its embedding. In this model, the first 10 values of the pooler vector are used as author embedding, the next 10 values are used for fandom classification, and the remaining 748 values are simply ignored. Although it appears to perform relatively well, as shown in Table 1, only the model using 384 components was submitted since it has the best overall performance. Furthermore, the plots show progress over 100 epochs but the model submitted was only trained for the first 20 epochs. The purpose of submitting only the 384-component model after 20 epochs is to preserve as much information from BERT’s original pretraining as possible in order to help generalization to unknown authors and fandoms. In order to speed up training, the validation metrics are calculated from only 600 texts pairs sampled at random from the 5,000 text pairs contained in the validation split set. Finally, the plots are smoothed using a gaussian distribution of standard deviation 4 epochs, with the original curves shown as a lighter shade. shows the performance for both the 10-component embedding and the 384-component embedding model. Table 2 shows the final results of the model, evaluated on the unseen test set using the TIRA platform [7]. 7. Concluding remarks It is interesting to note that the metrics measured on unknown authors reach similar values whether an embedding size of 384 or 10 is used. However, the smaller embedding size has inher- Figure 1: Validation metrics calculated on a random subset of the the validation split after every epoch of training. Figure 2: Losses of the first three training tasks as a function of the epoch number. Table 2 Official final results of the 384-component embedding model, with overall score calculated as the aver- age of all metrics. AUC c@1 F1 F0.5u Brier Overall 0.798 0.663 0.832 0.668 0.796 0.752 ently less capacity to store information so that its generalization performance on significantly different data can be questioned. Further studies, notably regarding other embedding sizes, the length of the input sequence, and the amount of short sequences sampled from a large text, would provide interesting information about the performance of models trained for classification on similarity-like tasks. Figure 3: Accuracy of the first three training tasks as a function of the epoch number. Figure 4: Left: loss of the network classifying fandoms from the author embedding. Center: accuracy of the network classifying fandoms from the author embedding. Right: ratio of the accuracy obtained by the main fandom classification task over the same accuracy obtained by the network working from the author embedding. While the model only processes short sequences of text, this must be repeated several times in order to get more reliable results. As a result, using the model as it is to regularly process large amounts of text pairs may be problematic, especially on hardware that isn’t specifically designed for tensor operations. References [1] M. Kestemont, I. Markov, E. Stamatatos, E. Manjavacas, J. Bevendorff, M. Potthast, B. Stein, Overview of the Authorship Verification Task at PAN 2021, in: CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021. [2] J. Bevendorff, B. Chulvi, G. L. D. L. P. Sarracén, M. Kestemont, E. Manjavacas, I. Markov, M. Mayerl, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wol- ska, E. Zangerle, Overview of PAN 2021: Authorship Verification, Profiling Hate Speech Spreaders on Twitter, and Style Change Detection, in: 12th International Conference of the CLEF Association (CLEF 2021), Springer, 2021. [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. arXiv:1810.04805. [4] M. Kestemont, E. Manjavacas, I. Markov, J. Bevendorff, M. Wiegmann, E. Stamatatos, M. Potthast, B. Stein, Overview of the Cross-Domain Authorship Verification Task at PAN 2020, Working notes of CLEF 2020 - Conference and Labs of the Evaluation Forum (2020) 22–25. URL: https://pan.webis.de. [5] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, 2017. arXiv:1711.05101. [6] G. Klambauer, T. Unterthiner, A. Mayr, S. Hochreiter, Self-normalizing neural networks, 2017. arXiv:1706.02515. [7] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\_5.