<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Contrastive Learning of Sample Pairs for Authorship Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mingcan Guo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhongyuan Han</string-name>
          <email>hanzhongyuan@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haoyang Chen</string-name>
          <email>hoyo.chen.i@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haoliang Qi</string-name>
          <email>haoliang.qi@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>In this paper, we describe a contrastive learning method using sample pairs to compute loss for tackling the authorship verification task. Classical sample-based contrastive learning is not applicable to this task because it needs to compare multiple samples in the same batch. Our method pushes away the distance between positive sample pairs and negative sample pairs according to the cosine similarity contrast of positive and negative sample pairs so that the model has the ability to judge whether a sample pair is more similar or less similar. Evaluation results on the dataset of the PAN corpus show that the method is effective and that it could determine whether more than 50% of the sample pairs are written by the same author with an overall score greater than 0.6. contrastive learning, sample pairs, authorship verification, cosine similarity fields. In the data set of the Authorship Verification task of PAN@CLEF 2023 [1, 2], similar to last year, the organizers provide four types of text data: interview, email, essay, and speech transcription. For this task, our work builds a sentence vector model based on the naive idea of using sample pair matching labeled data, where the labeled data used are common text pair samples, and each sample is "(text1, text2, label)" format, then use the contrastive learning method of improving the loss function to complete the task. At the same time, to solve the problem of fewer training samples, we use the method of splitting and reorganizing to obtain a large amount of train data and train our model through a large number of sample pairs to improve its reasoning ability on the test set. Finally, we submit our run on TIRA.io [3]. ORCID: 0000-0002-4977-2138 (M. Guo); 0000-0001-8960-9872 (Z. Han); 0000-0003-3223-9086 (H. Chen); 0000-0003-1321-5820 (H. Qi)</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Text classification is a basic research direction in NLP tasks. The purpose of Authorship Verification in
this direction is to judge whether two texts are written by the same person. Authorship Verification can be
widely used in article duplication verification, article source finding, plagiarism detection, and other</p>
    </sec>
    <sec id="sec-2">
      <title>2. Datasets</title>
      <p>In the organizers' dataset provided by the Authorship Verification task, a total of 8836 labeled text
pair samples from 56 authors are included. The label is represented by 1 or 0, representing whether the
two texts are from the same author.</p>
      <p>This year, there are four kinds of discourse types: interview, email, essay, and speech transcription.
The length of each text is between 1 and 3499, and the distribution number and average length of the
two texts are shown in Table 1.</p>
      <p>2023 Copyright for this paper by its authors.</p>
      <p>In PAN@CLEF 2023, the organizers firstly focuses on (cross-discourse type) authorship verification,
where both written language (i.e., essays and emails) and spoken language (i.e., interviews and speech
transcriptions) are represented in the set of discourse types.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology 3.1.</title>
    </sec>
    <sec id="sec-4">
      <title>Dataset Preprocessing</title>
      <sec id="sec-4-1">
        <title>Quantity</title>
        <p>275
450
93
68</p>
      </sec>
      <sec id="sec-4-2">
        <title>Average length</title>
        <p>478
352
388
409</p>
        <p>The training set contains a total of 56 authors. In the data set processing part, a total of 886 unique
texts are obtained after deduplication. The text list is established according to the order of these authors,
and each text is matched with a positive example belonging to the positive samples from the same
authors or negative samples from different authors.</p>
        <p>=1 (
2</p>
        <p>Specifically, suppose the extracted text list list_all= [texta1, texta2, textb1, ..., textz16], where a, b, etc.
represent different authors, 1, 2, etc. represent different texts by the same author. We recombine these
texts using a strategy where the first and second texts of the same author match, the second and third
texts match, and a total of ∑</p>
        <p>) positive sample pairs can be generated, where m is the total number
of authors,   represents the number of texts of the i-th author,  ∈ {1,2,3, … ,  }. Then match the first
author's first document with the second author's random text, and the first author's second document
can finally obtain 55,000 new sample data sets, such as (texta1, texta2, 1), (texta1, textb1, 0), etc.
with the third author's random text, a total of    (
− 1)negative sample pairs can be generated. We
3.2.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Network Architecture</title>
      <p>
        In the traditional way, most sentence vectors are formed by summing word vectors (word vectors
are usually trained by methods such as word2vec). Obviously, such a method is relatively simple and
crude, and the direct summing method does not utilize the interaction information between words.
Instead, there are various models based on BERT. In the BERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] series of pre-training models, by
stacking
      </p>
      <p>
        Transformer encoders, it is possible to capture the deep bidirectional word-to-word
information in a sentence and use the token vector in the output layer to represent the semantic
information of the entire sentence, such as BERT-flow [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and BERT-whitening [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], etc. Our work
adopts a text-based contrastive learning method. The purpose of contrastive learning is to obtain a better
representation vector of text by shortening the intra-class distance and increasing the inter-class distance.
Simcse [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proved the effectiveness of contrastive learning in the text paraphrase classification task.
However, Since it does not use the smallest data enhancement method to construct positive sample pairs
when calculating loss, but simply uses samples that are different from itself in a batch as negative
sample pairs. At the same time, a larger batch size will lead to a decrease in SimCSE performance. For
this, we use a new scheme to optimize the loss function cos.
      </p>
      <p>Note that Ωpos is the set of all positive sample pairs, and Ωneg is the set of all negative sample pairs.
For any positive sample pair (i, j) ∈Ωpos and negative sample pair (k, l) ∈Ωneg, there are cos (ui, uj) &gt;
cos (uk, ul), among them,   ,   ,   ,   represent their respective sentence vectors and the new
loss is shown in formula (1) and (2), where λ is the hyperparameter of the loss function.
 (  ,   ,   ,   )=  (cos(  ,   )− cos(  ,   ))
(1)

= log(1 +</p>
      <p>∑
(i,j)∈Ωpos,(k,l)∈Ωneg
  (  ,  ,  ,  ))
(2)</p>
      <p>Our structure is shown in Figure 1. Our work uses BERT-Large as our pre-training model, and the
pre-processed sample pairs {ta1,...,tb1,...,tc1}, {ta2,...,tc2,...,td2} are respectively sent to BERT-large for
encoding to obtain the vector representation of the text, then take the hidden layers of the first layer and
the last layer for average pooling to obtain sentence features {fa1,...,fb1,...,fc1}, {fa2,...,fc2,...,fd2}, The
features at the same position constitute a sample pair, and finally compare the cosine similarity of each
positive sample pair with the cosine similarity of the negative sample pair by widening the distance
between the positive and negative sample pairs, the positive sample pairs are closer to "more similar"
and farther away from "less similar", and the negative sample pairs are closer to "less similar" and
farther away from "more similar".</p>
    </sec>
    <sec id="sec-6">
      <title>4. Experiments and Results</title>
    </sec>
    <sec id="sec-7">
      <title>4.1. Experimental Setting</title>
      <p>In terms of dataset division, we preprocessed the train set and divided the train set and test set
according to 7:3.</p>
      <p>In this work, we choose BERT-Large, which has 1,024 hidden units, 24 layers and 340 million
parameters. We set the batch size to 30, encoder maximum length to 512, learning rate to 2e-5, and
random seed to 34. At the same time, we set the temperature coefficient λ of formula (1) to 20. We use
the AdamW optimizer to update our model weights at train phase. Finally, we used an A800 for
20epoch training.</p>
      <p>
        The last layer of BERT-Large output does not select CLS, but average pools the hidden layers of the
first and last layers into a new 1024-dimensional vector. In other words, the CLS embedding (of
BERTLarge’s output) is not used to represent the text segment pair of the input. Instead, all token embeddings
except CLS and SEP are average pooled [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. When we use BERT-Large as the encoder, we believe that
the described method can obtain more comprehensive sentence features than adopting CLS embeddings.
      </p>
      <p>During the prediction phase, we freeze the weights of the model to output the final result for the
dataset.
4.2.</p>
    </sec>
    <sec id="sec-8">
      <title>Results</title>
      <p>
        We obtain the organizers' two baselines for comparison, among which baseline-compressor23 is a
baseline author authentication method based on text compression, which uses the partial match
prediction (PPM) compression model of text1 to calculate the cross entropy of text2, and vice versa.
The mean and absolute difference of the two cross-entropies is used to estimate a score in [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ],
representing the probability that the two texts were written by the same author. baseline-cngdist23
provides a simple TF-IDF weighted bag-of-character-ngrams model representation, optimized by
rescaling after computing cosine similarity and projection operations so that they can act as probabilities.
      </p>
      <p>
        In addition, we also obtained the system of najafi22, the best performer in all submissions last year,
and ran our test set with reference to the parameters mentioned in the paper [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to better evaluate our
work.
      </p>
      <p>
        To evaluate the performance of our proposed model, we used the evaluation platform provided by
PAN, which includes the following metrics:
• AUC: the conventional area under the curve score.
• c@1: rewards systems that leave complicated problems unanswered [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
• f_05_u: focus on deciding same-author cases correctly [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
• F1: a harmonic way of combining the precision and recall of the model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
• Brier: Brier Score evaluates the accuracy of probabilistic predictions [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>We input the split train data and test data into our model for training and testing, and then we use
the evaluation program to evaluate the results. As shown in Table 2, our method performs best on auc,
f_05_u, brier and overall.</p>
      <p>Ultimately, we submitted two runs, named irregular-strategist and uniform-reward. Between them,
the irregular-strategist uses the 11th epoch weight of the model training (the overall performance is the
best), and the uniform-reward is the 20th (the last epoch), their performance on pan23 authorship
verification test is shown in Table 3. It can be seen that our best run exceeded two baselines, and the
overall reached 0.614. Since the uniform-reward used the last epoch and produce overfitting problems,
it only surpassed najafi22 and obtained an overall score of 0.572.</p>
    </sec>
    <sec id="sec-9">
      <title>5. Conclusion</title>
      <p>This paper mainly introduces our work results on authorship verification 2023. Our work uses a
sample pair contrastive learning method based on the bert-large model and improves the loss calculation
function to judge whether two texts are written by the same author. Our method is effectively verified
by comparing with different method or models, such as the baseline on the divided dataset. In the
follow-up work, we should incorporate more effective methods to improve the performance of the
system, such as adding features in extracting the author's methods style text and compressing long text.
Our method still has room for improvement.</p>
    </sec>
    <sec id="sec-10">
      <title>6. Acknowledgements</title>
      <p>This work is supported by the National Social Science Foundation of China (No. 22BTQ101).</p>
    </sec>
    <sec id="sec-11">
      <title>7. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kredens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pezik</surname>
          </string-name>
          ,
          <article-title>Overview of the Authorship Verification Task at PAN 2023, in: CLEF 2023 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Borrego-Obrador</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Chinea-Ríos, Overview of PAN 2023: Authorship Verification, Multi-Author Writing Style Analysis, Profiling Cryptocurrency Influencers, and Trigger Detection, in: Experimental IR Meets Multi-linguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF</source>
          <year>2023</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolyada</surname>
          </string-name>
          ,
          <article-title>Continuous Integration for Reproducible Shared Tasks with TIRA.io</article-title>
          ,
          <source>in: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR</source>
          <year>2023</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2023</year>
          , pp.
          <fpage>236</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>Proceedings of NAACL-HLT</source>
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>On the sentence embeddings from pre-trained language models</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>9119</fpage>
          -
          <lpage>9130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <article-title>Whitening sentence representations for better semantics and faster retrieval</article-title>
          ,
          <source>arXiv preprint arXiv:2103.15316</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Simcse:
          <article-title>Simple contrastive learning of sentence embeddings</article-title>
          ,
          <source>in: 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2021</year>
          ,
          <article-title>Association for Computational Linguistics (ACL</article-title>
          ),
          <year>2021</year>
          , pp.
          <fpage>6894</fpage>
          -
          <lpage>6910</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <article-title>Authorship verification based on fully interacted text segments</article-title>
          ,
          <source>CLEF</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Najafi</surname>
          </string-name>
          , E. Tavan,
          <article-title>Text-to-text transformer in authorship verification via stylistic and semantical analysis</article-title>
          ,
          <source>in: Proceedings of the CLEF</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Peñas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <article-title>A simple measure to assess non-response (</article-title>
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Generalizing unmasking for short texts</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>654</fpage>
          -
          <lpage>659</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <article-title>Scikit-learn: Machine learning in python</article-title>
          ,
          <source>the Journal of machine Learning research 12</source>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Brier</surname>
          </string-name>
          ,
          <article-title>Verification of forecasts expressed in terms of probability</article-title>
          ,
          <source>Monthly weather review 78</source>
          (
          <year>1950</year>
          )
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>