<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Authorship Verification Based on CoSENT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>ZhaoHao Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leilei Kong</string-name>
          <email>kongleilei@fosu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mingjie Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>6</lpage>
      <abstract>
        <p>Authorship verification is the task of determining whether the authorship of texts is the same based on stylistic features. In this paper, we propose two different approaches to the author attribution task. The first approach employs a positive-example-based data optimization approach to reorganize the training dataset. The second method uses CoSENT, a featurebased text classification method, to accomplish the task of authorship verification. This method enables the model to have a better ability to identify whether sentence pairs are similar or dissimilar.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Authorship Verification</kwd>
        <kwd>Data optimization</kwd>
        <kwd>Feature-based</kwd>
        <kwd>CoSENT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Datasets</title>
      <p>The training dataset provided by the PAN@CLEF 2023 organization [4] consists of
crossdiscourse types of authorship verification cases using the following discourse types (DT): essays,
emails, interviews, and speech transcriptions. Among the four DTs, essays and emails belong to the
written discourse, and interviews and speech transcriptions belong to the spoken discourse. The
dataset includes texts from approximately 56 authors who are native English speakers and have
similar ages ranging from 18 to 22. The topic of the text samples is not limited or constrained. The
total number of text pairs in the dataset provided by PAN@CLEF 2023 is 8,836. Each problem is
composed of two texts belonging to two different DTs. All text pairs contain 886 texts from different
authors. The number distribution of different discourse types is shown in Table 1.</p>
      <p>Since the text length of texts of emails and interviews can be very small, each text belonging to
these DTs is actually a concatenation of different messages. We use the &lt;new&gt; tag to denote the
boundaries of original texts. New lines within a text are denoted with the &lt;nl&gt;tag. In addition,
authorspecific and topic-specific information, e.g., named entities, has been replaced with corresponding
tags. In spoken discourse types, additional tags are used to indicate nonverbal vocalizations (e.g.,
cough, laugh).
3.</p>
      <p>3.1.</p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
    </sec>
    <sec id="sec-4">
      <title>Data preprocessing</title>
      <p>For deep learning methods, the more text data, the more text features the language model can learn.
Of the 8,836 text pairs in the original training dataset, there are only 886 distinct texts. We try to use
these different texts to scramble and combine them into a more extensive data set for model training.
We combine two groups of texts belonging to the same author into a positive pair, and the total
number of positive pairs is 6,945. In order to keep the ratio of positive and negative samples at 1:1,
each of the 886 different texts in the original training dataset is combined with any text of other
authors to form a negative pair, and each text is randomly combined with 8 negative pairs. There are a
total of 7088 negative sentence pairs. A total of 14033 positive and negative sentence pairs are
shuffled and divided into 75% training set and 25% test set, as shown in Table 2.
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Model settings</title>
      <p>In this paper, we choose to use a feature-based text-matching approach for this task. The Bert
model is fine-tuned during the training phase using the expanded training dataset. In the prediction
stage, we send the text1 and text2 of the text pair separately to the pre-trained language model, bert,
where the text is encoded respectively and the cosine similarity is calculated. Finally, whether the
cosine similarity is greater than 0.5 is used as the classification standard for the sentence pair. The
overall structure of the model is shown in Figure 1.</p>
      <p>We hope to improve the classification ability of the model. The work of Su [2] suggests that a new
loss function is proposed, which improves the model's ability to identify similar text and distinguish
cos
,
&gt; cos
,
dissimilar text to a certain extent. We record pos as the set of all positive sample pairs and neg as
the set of all negative sample pairs.</p>
      <p>In fact, we hope that for any positive sample pair i, j  pos and negative sample
pair k, l  neg , both have</p>
      <p>Where ui , u j , u k , ul are their respective sentence vectors. Here we only hope that the similarity
of positive sample pairs is greater than that of negative sample pairs, and it is up to the model to
decide how much larger it is. The loss function is shown in formula (2).</p>
      <p>
        log
1 +
, ∈Ω , , ∈Ω
cos
λ is a hyperparameter, and we set it to 20 in this experiment.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(2)
      </p>
    </sec>
    <sec id="sec-6">
      <title>Experiment setting</title>
      <p>In this work, BERTBASE (L=12, H=768, A=12, Total Parameters=110M) is chosen as the
pretrained model size, and we use Pytorch to construct BERT and fully connected network classification
model. Our hyperparameters are set as follows: the batch size is 16, the maximum sequence length is
512, the initial learning rate is set to 1e-5, and 20 epochs are trained. Each training is optimized with
AdamW, and the warm-up rate is set to 0.1.</p>
    </sec>
    <sec id="sec-7">
      <title>Evaluation and results</title>
    </sec>
    <sec id="sec-8">
      <title>5.1. Evaluation</title>
      <p>To evaluate the proposed models, we use the TIRA [5] evaluation tool with the following
metrics:</p>
      <p>AUC: The area-under-the-curve (ROC) score
F1-score: F1 score is the harmonic mean between precision and recall [6].
c@1: rewards systems that leave complicated problems unanswered [7].</p>
      <p>F_0.5u: A measure that puts more emphasis on deciding same-author cases correctly [8].</p>
      <p>Brier: The complement of the Brier score for evaluating the goodness of (binary) probabilistic
Classifiers [9].</p>
      <p>5.2.</p>
    </sec>
    <sec id="sec-9">
      <title>Results</title>
      <p>We test the performance of our model on a new reorganized test set and test our model on the
PAN23 authorship verification test dataset.The test results are shown in Table 3.</p>
      <p>Brier
0.935</p>
    </sec>
    <sec id="sec-10">
      <title>Conclusion</title>
      <p>In this paper, we present our approach to authorship verification on PAN 2022. We re-pair each
dissimilar sentence in the training dataset into a new larger training dataset. By augmenting the
dataset in this way, we hope the model can learn more about the similarities between texts by the
same author and the differences between texts by different authors. In addition, we also introduced a
special loss function in the model to make the model better learn the similarity and difference
information between sentence pairs. Then use the pre-trained language model BERT to extract text
features and calculate sentence cosine similarity to judge whether they are from the same author.</p>
      <p>From the results, the method does not perform well on open datasets with unknown authors. In the
follow-up work, we should use more effective methods to enhance the data and improve the
classification ability of the model in the open set.</p>
    </sec>
    <sec id="sec-11">
      <title>Acknowledgments References</title>
      <p>This work is supported by the National Social Science Foundation of China (No. 22BTQ101).
[2] J. Su, CoSENT (I): A more efficient sentence vector scheme than Sentence-BERT, 2022.</p>
      <p>URL: https://spaces.ac.cn/archives/8847
[3] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P.</p>
      <p>Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, the
Journal of machine Learning research 12 (2011) 2825–2830
[4] Bevendorff, J., Borrego-Obrador, I., Chinea-Ríos, M., Franco-Salvador, M., Fröbe, M., Heini,
A., Kredens, K., Mayerl, M., Pęzik, P., Potthast, M., Rangel, F., Rosso, P., Stamatatos, E.,
Stein, B., Wiegmann, M., Wolska, M., &amp; Zangerle, E. (2023). Overview of PAN 2023:
Authorship Verification, Multi-Author Writing Style Analysis, Profiling Cryptocurrency
Influencers, and Trigger Detection. In A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis,
A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, &amp; N. Ferro (Eds.),
Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the
Fourteenth International Conference of the CLEF Association (CLEF 2023). Thessalonikki,
Greece: Springer.
[5] Fröbe, M., Wiegmann, M., Kolyada, N., Grahm, B., Elstner, T., Loebe, F., Hagen, M., Stein,
B., &amp; Potthast, M. (2023). Continuous Integration for Reproducible Shared Tasks with
TIRA.io. In J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U.
Kruschwitz, &amp; A. Caputo (Eds.), Advances in Information Retrieval. 45th European
Conference on IR Research (ECIR 2023) (pp. 236-241). Berlin Heidelberg New York:
Springer.
[6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P.</p>
      <p>Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, the
Journal of machine Learning research 12 (2011) 2825–2830
[7] A. Peñas, A. Rodrigo, A simple measure to assess non-response (2011)
[8] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Generalizing unmasking for short texts, in:
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers), 2019, pp. 654–659
[9] G. W. Brier, et al., Verification of forecasts expressed in terms of probability, Monthly
weather review 78 (1950) 1–3</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kredens</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pęzik</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bevendorff</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Overview of the Authorship Verification Task at PAN 2023</article-title>
          .
          <article-title>In CLEF 2023 Labs and Workshops, Notebook Papers</article-title>
          .
          <article-title>Conference and Labs of the Evaluation Forum (CLEF 2022)</article-title>
          .
          <article-title>CEUR-WS.org</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>