=Paper= {{Paper |id=Vol-2936/paper-182 |storemode=property |title=Dual Neural Network Classification Based on BERT Feature Extraction for Authorship Verification |pdfUrl=https://ceur-ws.org/Vol-2936/paper-182.pdf |volume=Vol-2936 |authors=Xiaogang Miao,Haoliang Qi,Zhijie Zhang,Guiyuan Cao,Ruilan Lin,Wenbin Lin |dblpUrl=https://dblp.org/rec/conf/clef/MiaoQZCLL21 }} ==Dual Neural Network Classification Based on BERT Feature Extraction for Authorship Verification== https://ceur-ws.org/Vol-2936/paper-182.pdf
Dual Neural Network Classification Based on BERT Feature
Extraction for Authorship Verification
Notebook for PAN at CLEF 2021

Xiaogang Miao, Haoliang Qi*, Zhijie Zhang, Guiyuan Cao, Ruilan Lin, Wenbin Lin
Foshan University, Foshan, China


                 Abstract
                 Authorship verification is the task of deciding whether two texts have been written by the same
                 author. We regard authorship verification as a classification task. A dual neural network is
                 proposed to classify the features of text extraction. Especially, BERT is exploited as the
                 encoder to extract the text features. After training and forecasting on the given pan20-
                 authorship-verification-training-small data set, the weighted average of the specified
                 evaluation indexes (including: AUC, F1, c@1, f0.5u, Brier) can reach about 0.85.

                 Keywords 1
                 Dual Neural Network, BERT, Long Text Classification

1. Introduction

   Authorship verification is an active research field in computational linguistics. By comparing the
writing style of text, the author determines whether the same author has written two texts [1,2]. It has
been widely used in the academic field. It can be used as a detection direction of academic paper fraud
and plagiarism, not only detecting word repetition rate. In 2021, the training data are the same as last
year, open-set Authorship Verification. It is difficult to see if there are new authors and themes, so the
task difficulty has been increased [3]. Essentially, given two paragraphs, it is a text pair classification
problem to determine whether the same author writes it and whether the label is "true" or "false". This
paper uses a double-input neural network model, which can extract and learn each text information step
by step and then train and predict the results in the upper neural network.

2. Datasets

    The dataset used by the Authorship verification is given by the evaluation web pan@clef. The train
(calibration) and test datasets consists of pairs of (snippets from) two different fanfics, that were
obtained drawn from fanfiction.net. Each pair was assigned a unique identifier and we distinguish
between same-author pairs and different-authors pairs [3].
    There are two datasets verified by the author. The difference lies in the number of text pairs included.
For larger datasets and smaller datasets [4], we use smaller datasets, which contain 52,601 pairs of text.
They are all written in a JSON file, including 'id', 'fandoms', 'pair' in pair.json file, then 'id', 'same' and
'authors' in truth.json file. In fact, we mainly use the text pairs in the 'pair' item to extract the required
information [3].
    In the data set, most of the text characters are between 30,000 and 20,670, a few are between 30,000
and 300,000, and most of the words are between 55,433 and 42,372. There are only NYAN, HUAE and

1
 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
EMAIL: xiaogangmiao163@gmail.com (A. 1); qihaoliang@fosu.edu.cn (A. 2) (*corresponding author); zhangzhijie5454@gmail.com (A. 3);
caoguiyuan2020@163.com (A. 3); LRLlinruilan@163.com (A. 3); LWBlinwenbin@163.com (A. 3)
ORCID: 0000-0001-6816-0922 (A. 1); 0000-0003-1321-5820 (A. 2)
            ©️ 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
other words in the text pairs of some of the data, because the data is from fanfection.net, we can't
determine whether it is some wrong data.

3. Method

   We adopt a dual-input neural network model to learn text information and a feature extraction
processing text based on the BERT model [5]. The specific implementation method is as follows.

    3.1.           Dual Neural Network Classifier

   We used a dual-input model for the authorship verification task for classification training. The
specific model is shown below. Figure 1-(a) is the preprocessing process of the original text, and the
long text is aligned with the clause. Figure 1-(b) is to put the well-divided sentence set into the BERT
model to extract the corresponding feature vector of the text. Figure 1-(c) indicates that the extracted
feature vectors are sent to the dual-input neural network for final prediction classification.
                                                                                              classification neural
                                                      text feature
                                                                                                     network
                                                    representation
        sentence list1       sentence list2


                                              Feature extraction based on
                                                      BERT model
               sentence segmentation                                        Neural Network Information       Neural Network Information
                                                                                    Extraction                       Extraction

                                                     sentence list
               text1          text2                   (s1,s2,…..)           text1 representation               text2 representation
              (a)text pretreatment                (b)feature extraction                    (c)model classification

Figure 1: model summary


    3.2.           Text Processing

    First of all, according to statistics, each text segment is based on multiple sentences, so we use
punctuation marks as sentence separators to divide, mainly full stop, exclamation mark, and question
mark. In this way, the long text segment can be divided into several small sentences, which can be
processed by BERT. However, there are also some problems. For example, the sentences which are
composed of multiple repeated words such as 'NYAN' mentioned in the data set do not have the above
sentence segmentation conditions.
    Some text segments are more than 20,000 characters in length, but sentence separators are not used
in the whole paragraph, which may be related to the author's writing habits. Therefore, if there is no
sentence after the first sentence segmentation, we need to segment the sentence directly according to
the length of the character. In theory, some information will be lost. Fortunately, the sentences that need
to be segmented occupy a small part of the whole dataset, which is less than 0.1%, so it has little impact
on the final result.
    Through the above text segmentation operation, we will divide the long text into several short
sentences. Here we use BERT as a feature extractor. As a pre-training model, BERT has multiple
versions. We use BERT-Large, Cased (Whole Word Masking): 24-layer, 1,024-hidden, 16-heads,
340M parameters [5]. Because the parameters have been trained in advance, in the normal text
classification task, we use BERT as a feature extractor and then a downstream task (usually neural
network) as a model and then fine-tune the overall performance. But in this task, we can't input all the
text at one time, so we only consider using BERT as a feature extractor. The specific method is to
extract the CLS vector of each short sentence as the feature of the short sentence through 12 level
Transformer. CLS obtains the sentence-level information representation through the self-attention
mechanism, which can capture the context information representation in the current environment. Then
the CLS vectors of several short sentences are superimposed to represent the sentence feature vectors
of long text.

4. References

   Based on the above general introduction, we will introduce the working model and results in detail
below.

    4.1.         Second level heading

    Because of the length of the text in the task data set and the information contained in the text is very
scattered, it is difficult to use sliding window truncation. However, since the computation resources and
time consumed by BERT increase with the square level of token length, it can't handle too long tokens.
At present, it only supports 512 tokens at most. If the token is too long, it is easy to overflow memory.
Therefore, we need to design a clever method to solve this problem when using BERT to process long
text. Therefore, we need to first make some clauses in the text. First of all, we divide the data set into
two parts: the positive and negative samples of the data set are 50%, the first half is true, and the later
version is false. Then we take 52,000 as the training set and the rest as the verification. The first method
is to splice the feature vectors of two texts because each feature vector is 768 dimensions, and the
spliced vector is a feature representation of 1,536 dimensions. Then we send this feature into a simple
neural network for binary classification, but after training, we find that its accuracy in the training set
is not very high. The overall verification set is only 0.82. Then the second method is that we build a
double input neural network, which is the fusion of the information from the training of the two text
features in their respective neural networks [6]. After a two-classification prediction, the final overall
on the test set can reach 0.87. We speculate that the reason for this may be that the neural network can
not effectively recognize the whole order of sentences in the first step after using splicing, or we have
not trained it sufficiently. In fact, the number of neurons and activation function, and other parameters
are selected and experimented with according to our experience. The implementation method is not
necessarily the optimal result under this idea but only a reference scheme.

    4.2.         Second level heading

    In addition to the above data allocation, we also allocate 80% of the training data and 20% of the
validation data to compare the training before starting the formal training to find the appropriate number
of training epochs. Table 1 shows the experimental results.
Table 1
The score of training data and validation data after processing
     data set            AUC              C@1               F0.5U          F1-SCORE          OVERALL
      training          0.866            0.938              0.858             0.865            0.881
   verification         0.855            0.914              0.839             0.857            0.867

   Because the final score on the test set was not given at the time of final submission, we can only
show the score on the training set now, which may not be very accurate.

5. References

   In the authorship verification task, we use the dual-input neural network model to accept the text
features extracted from the BERT model for classification learning. In the model, the feature
information of the two texts can be learned respectively and then sent to the upper neural network for
classification. To some extent, it avoids the decrease of learning efficiency caused by the fact that the
splicing cannot learn the two-sentence sequences. Good results are achieved on the training dataset
given by the evaluation task. Note that we did not fine-tune the native model when we used BERT as a
feature extractor. Because we first segment the text and then extract features, if you want to fine-tune
should be considered into the segmented sentence for operation. However, the information obtained by
this model may be partial, incomplete information based on the whole text, and we are not sure whether
this will help on open sets. It is hoped that someone can solve the problem of feature extraction accuracy
in the future.

6. References

   [1] Kestemont M, Manjavacas E, Markov I, et al. Overview of PAN 2021: Authorship Verification,
   Profiling Hate Speech Spreaders on Twitter, and Style Change Detection[J].
   [2] Bevendorff J , Chulvi B , GLDLP Sarracén, et al. Overview of PAN 2021: Authorship
   Verification, Profiling Hate Speech Spreaders on Twitter, and Style Change Detection: Extended
   Abstract[M]. 2021.
   [3] Organization(2021),https:/pan.webis.de/
   [4] Rong X , Wang X , Yan L , et al. Research and application on improved BP neural network
   algorithm[C]// Industrial Electronics & Applications. IEEE, 2010.
   [5] Devlin J , Chang M W , Lee K , et al. BERT: Pre-training of Deep Bidirectional Transformers
   for Language Understanding[J]. 2018.
   [6] Boenninghoff B , Rupp J , Nickel R M , et al. Deep Bayes Factor Scoring for Authorship
   Verification[C]// Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum.
   2020.