=Paper=
{{Paper
|id=Vol-2936/paper-186
|storemode=property
|title=Encoding Text Information By Pre-trained Model For Authorship Verification
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-186.pdf
|volume=Vol-2936
|authors=Zeyang Peng,Leilei Kong,Zhijie Zhang,Zhongyuan Han,Xu Sun
|dblpUrl=https://dblp.org/rec/conf/clef/PengKZHS21
}}
==Encoding Text Information By Pre-trained Model For Authorship Verification==
Encoding Text Information By Pre-trained Model For Authorship Verification Notebook for PAN at CLEF 2021 Zeyang Peng1, Leilei Kong1*, Zhijie Zhang1, Zhongyuan Han1, Xu Sun2 1 Foshan University, Foshan, China 2 Heilongjiang Institute of Technology, Haerbin, China Abstract Authorship verification is the task of deciding whether two texts have been written by the same author based on comparing the texts’ writing styles. We present a classification method based on encoding text information by a pre-trained model for authorship verification. The proposed model achieved the highest c@1 and F1-score on the small dataset of PAN Authorship Verification datasets. Keywords 1 Pre-trained model, Text information, Classification, Authorship verification 1. Introduction There have such phenomena as telecommunication, email fraud and terrorist attacks in today's society, so it is important to verify unknown authorship. Authorship verification is an active research area of computational linguistics that can be considered as a fundamental question of stylometry, namely whether or not two texts are written by the same author [1]. Accordingly, it has become one of the staple sharing tasks at PAN. The work presented in this paper was developed as a solution to the Authorship verification task for the competition PAN @ CLEF 2021 1. At PAN 2021, authorship verification belongs to an open-set verification task that test dataset contains verification cases from training dataset’s unseen authors and topics [2], so a writing style model has built for training dataset’s author or topic is not supported on open-set verification task. Accordingly, our idea is to encode text information to get the text feature and determine whether two texts are the same author by comparing the similarity of the features. In recent years, more and more pre-trained models, typified by BERT [3], have performed well in natural language processing. In particular, BERT, which stands for Bidirectional Encoder Representations from Transformers [4], has achieved considerable improvement in encoding text information. However, we also noticed that BERT could not encode long text efficiently. In order to encode the long text, our method is to split long texts into short texts that BERT can encode, then obtain the similarity of local text by combining two short texts from two long texts, respectively. Finally, the overall similarity of two long texts can be obtained by integrating these local similarities. 2. Datasets 1 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania EMAIL: pengzeyang008@163.com (A. 1); kongleilei@fosu.edu.cn (A. 2) (*corresponding author); zhangzhijie5454@gmail.com (A. 3); hanzhongyuan@fosu.edu.cn (A. 4) ORCID: 0000-0002-8605-4426 (A. 1); 0000-0002-4636-3507 (A. 2); 0000-0002-4854-0618 (A. 3); 0000-0001-8960-9872 (A. 4); © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Authorship verification task datasets consist of many fanfictions, which were obtained drawn from fanfiction.net, and they are writing in which fans use media narratives and pop cultural icons as inspiration for creating their own texts. The datasets include large and small training dataset, which consists of pairs of (snippets from) two different fanfics. They include 275,565 and 52,601 pairs of texts, respectively. Table 1 shows the results of data analysis over both datasets. Table 1 The detail of Authorship verification datasets Dataset Samples Positive Max Min Mean Samples Characters Characters Characters Small 52601 27834 296887 20670 21424.93 Large 275565 147778 943947 20355 21426.08 The Samples column shows the number of text pairs, and the last three columns show the Characters’ statistics of texts. 3. Method 3.1. Network Architecture Given two texts, denoted as text1 and text2, the task of Authorship Verification is to decide whether they are written by the same author. Suppose text1 = {t11, t12, ……, t1N}, where t11 is the first fragment of text1 and t1N is the Nth fragment of text1. Text2 = {t21, t22, ……, t2N}, where t21 is the first fragment of text2 and t2N is the Nth fragment of text2. The N number set to 30, so a text pair would be split into 30 short text pairs. Figure 1 shows the network architecture. text1 text2 t11 t12 … t1i … t1N t21 t22 … t2i … t2N Text pairs t11t21 t12t22 … t1it2i … t1Nt2N Tokenized and padded BERT 𝑓𝑓𝑡𝑡11 𝑡𝑡21 𝑓𝑓𝑡𝑡12 𝑡𝑡22 … 𝑓𝑓𝑡𝑡1𝑖𝑖𝑡𝑡2𝑖𝑖 … 𝑓𝑓𝑡𝑡1𝑁𝑁𝑡𝑡2𝑁𝑁 Text Information Concatenate Text Representation GlobalAveragePooling1D Layer Fully Connected Layer Fully Connected Layer Softmax output Figure 1: Architecture diagram for our model. It can be observed, a text pair can be split into N short text pairs, and the t1it2i is one of the short text pairs that consist of the ith fragment of text1 and text2. By using BERT to encode these short text pairs, we obtain more efficient text features, and the 𝑓𝑓𝑡𝑡1𝑖𝑖𝑡𝑡2𝑖𝑖 is the text feature of the ith short text pair encoded. Then the text representation is obtained by concatenating these text features. Finally, we feed the text representation into a fully connected neural network to build a binary classification model, which determines whether two texts have been written by the same author. 3.2. Text Preprocessing Text preprocessing corresponds to the first three steps in Figure 1. We use punctuation as a separator to intercept the first 30 fragments of each text sample. Now each train sample’s text1 and text2 all have 30 fragments, and their labels are the same, so a text pair constructs 30 sub-training samples. We combine the corresponding fragments of text1 and text2, then they were tokenized and sequence padded to be a vector of max length 256. I shift a bit, warily letting my eyes dart from one owl to the other -- but my eyes are trained on the Barn Owl the most. Like Hoole...so like Hoole... He turns a bit, and our eyes meet directly. I can"t describe it...in this next Tokenized and padded moment, I don"t look away, how awkward it seems. I stare into his eyes. They"re like Hoole"s... They are Barn Owl eyes, but Hoole"s eyes. They"re [CLS, 146, 5212, 170, 2113, 117, 27122, 5074, 1139, his eyes...Hoole"s eyes... 1257, 23466, 1121, 1141, 19976, 1106, 1103, 1168, 118, Text1 They hold that light of valor, justice, that 118, 1133, 1139, 1257, 1132, 3972, 1113, 1103, 6523, one glow that I always made me feel my gizzard 1179, 152, 10481, 1103, 1211, 119, 2409, 9800, 9016, 119, 119, 119, 1177, 1176, 9800, 9016, 119, 119, 119, 1124, twitch in the bottom of my heart. Hoole... He 3587, 170, 2113, 117, 1105, 1412, 1257, 2283, 2626, 119, never wanted me, did he? He loved me, but only 146, 1169, 107, 189, 5594, 1122, 119, 119, 119, 1107, who I was pretending to be. He never loved me, 1142, 1397, 1721, 117, 146, 1274, 107, 189, 1440, 1283, only Emerilla...lucky owl. I just wanted to 117, 1293, 9540, 1122, 3093, 119, 146, 5698, 1154, 1117, belong -- because I was nothing -- and he gave 1257, 119, 1220, 107, 1231, 1176, 9800, 9016, 107, 188, me that. He was so fair, so kind...he gave me 119, 119, 119, 1220, 1132, 6523, 1179, 152, 10481, 1257, something even Kreeth could not. Mum and Da -- 117, 1133, 9800, 9016, 107, 188, 1257, 119, 1220, 107, Ygryk and Pleek -- were supposed to be my 1231, 1117, 1257, 119, 119, 119, 9800, 9016, 107, 188, parents. I was supposed to be their chick, Lutta. 1257, 119, 119, 119, SEP, 107, 1398, 1209, 1561, 1141, 1114, 2733, 117, 107, 1119, 1163, 117, 1593, 2566, 117, …… 1117, 18456, 21914, 119, 17355, 10047, 1127, 1640, 8173, "All will become one with Russia," he 132, 1208, 1152, 176, 12736, 1174, 1164, 117, 1111, 170, said, almost simply, his cheer eerie. Fists 13316, 117, 170, 6658, 117, 170, 4346, 118, 1175, 1108, were already clenched; now they groped about, 1720, 119, 1130, 1199, 1236, 117, 1142, 1814, 1123, 1133, for a pan, a rifle, a sword-there was nothing. 170, 6106, 1104, 3893, 118, 7175, 1105, 11945, 26237, In some way, this brought her but a sigh of 1324, 117, 1131, 1108, 6393, 117, 1127, 1136, 1303, 1106, 8813, 1112, 1218, 119, 1409, 7062, 1508, 1117, 4994, relief-Gilbert and Roderic, she was reminded, 1493, 1113, 11945, 26237, 1324, 119, 119, 119, SEP, were not here to suffer as well. If Ivan put 0, 0, 0, 0, ……, 0] Text2 his giant hands on Roderich... Click, went an object, and Elizaveta was snapped into the world when her own instincts pulled her head away; on time, for Ivan"s collar was soon found to be touching the tip of her nose: she went cross-eyed staring at the thing: it pinched her skin, now red with anger. How dare you- Oppression! No one could lay that yoke above her again! -she would never allow it! Even when a sort of purple shroud seemed to wrap itself about Ivan and blaze as a fire would... …… Figure 2: This is an example of how we preprocessed a text pair. Theis a special symbol added in front of every input example, and the is a special separator token that can be used to separate two fragments in this example. 4. Experiments and Results 4.1. Experimental setting In this work, BERTBASE (L=12, H=768, A=12, Total Parameters=110M) is chosen as pre-trained model size, and we use Keras to construct BERT and fully connected network classification model. A text pair is split into 30 short texts and tokenized to a vector that its shape is (30, M), and there have 52,601 such vectors on the small dataset, where the M is the max length of short texts. In the fine-tuning pre-trained model phase, we set 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏ℎ_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 30 and use sparse categorical cross-entropy as the loss function, and the optimization method is Adam with a 2e-5 learning rate. Using BERT to encode text information, we obtain the feature vector and reshape it to (52601, 30, 768). After global pooling of one-dimension, its shape becomes (52601,768). The first fully connected layer output hidden size is 16, and its activation is ReLU. The other FC layer output hidden size is 2, and its activation is softmax. The final FC network is trained for 400 epochs, and its optimization is Adam. 4.2. Results To evaluate the proposed model, two splitting strategies are adopted on the given training small dataset. The first one is 36,821 text pairs for the training dataset and 15,780 text pairs for the validation dataset, and the other is 45,000 for training dataset and 7,601 for valid dataset. The experimental results on the validation dataset are denoted as Val-1 and Val-2 separately. Table 2 Validation results on the small dataset. Datasets AUC c@1 f_05_u F1 Brier Overall Val-1 0.920 0.921 0.921 0.926 0.921 0.922 Val-2 0.944 0.943 0.958 0.945 0.943 0.946 It can be observed, with the increase of the training sample, the overall score would be increased. Table 3 shows the final evaluation results on the small datasets of the PAN 2021 authorship verification task evaluated on the TIRA platform [5]. Our model is denoted as peng21. Table 3 Final results on the test dataset. Team AUC c@1 f_05_u F1 Brier Overall weerasinghe21 0.9666 0.9103 0.9270 0.9071 0.9290 0.9280 peng21 0.9172 0.9172 0.9200 0.9167 0.9172 0.9177 embarcaderoruiz21 0.9470 0.8982 0.8785 0.9040 0.9072 0.9070 menta21 0.9385 0.8662 0.8787 0.8620 0.8762 0.8843 rabinovits21 0.8129 0.8129 0.8186 0.8094 0.8129 0.8133 ikae21 0.9041 0.7586 0.7233 0.8145 0.8247 0.8050 unmasking21 0.8298 0.7707 0.7466 0.7803 0.7904 0.7836 naive21 0.7956 0.7320 0.6998 0.7856 0.7867 0.7600 compressor21 0.7896 0.7282 0.7027 0.7609 0.8094 0.7581 5. Conclusion In this paper, we propose the method that utilizes a pre-trained model to encode text information to solve the authorship verification in the PAN@CLEF 2021. To resolve the problem of long text encoding, the method we proposed is to split long texts into short texts that a pre-trained model, BERT, can encode. As can be observed above Tabel 3, the classification model achieved the highest c@1 and F1-score on the small dataset of PAN Authorship Verification datasets. Accordingly, the approach described can encode long text information efficiently in long text pairs. 6. Acknowledgments This work is supported by the National Natural Science Foundation of China (No.61806075 and No.61772177) and the Social Science Foundation of Heilongjiang Province (No. 210120002). 7. References [1] Koppel M, Winter Y. Determining if two documents are written by the same author[J]. Journal of the Association for Information Science and Technology, 2014, 65(1): 178-187. [2] Kestemont, M., Markov, I., Stamatatos, E., Manjavacas, E., Bevendorff, J., Potthast, M. and Stein, B.: Overview of the Authorship Verification Task at PAN 2021. Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, CEUR-WS.org (2021) [3] Devlin J., Chang M.W., Lee K., et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, 1: 4171-4186 [4] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, December 2017, pp:6000–6010. [5] Potthast M, Gollub T, Wiegmann M, et al. TIRA integrated research architecture[M]//Information Retrieval Evaluation in a Changing World. Springer, Cham, 2019: 123-160.