<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Team riyahsanjesh at PAN: Multi-feature with CNN and Bi-LSTM Neural Network Approach to Style Change Detection Notebook for the PAN Lab at CLEF 2024</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Riya Sanjesh</string-name>
          <email>riya.sanjesh@presidencyuniversity.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alamelu Mangai</string-name>
          <email>alamelu.jothidurai@presidencyuniversity.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLEF 2024: Conference and Labs of the Evaluation Forum</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Presidency University</institution>
          ,
          <addr-line>Ittagallpura, Bengaluru</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>PAN 2024 conducted Multi-Author Writing Style Analysis task which aims to detect style changes between consecutive paragraphs in a text. The task provides datasets with three levels of complexity to test the submissions. This paper describes our attempt towards solving this problem. It involves multiple stylometric features extracted from the input text and detecting any style changes using a trained Neural Network based on CNN and Bi-LSTM along with global max pooling layers. The proposed system obtained a F1 score of 0.78, 0.724, 0.601 for the 3 subtasks on validation data set provided.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2024</kwd>
        <kwd>Multi-Author Writing Style Analysis</kwd>
        <kwd>Stylometric Features</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Bi-LSTM</kwd>
        <kwd>Convolution Neural Network1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Multiple features are extracted from the source documents and embeddings
generated before feeding them to the neural network. The proposed system
has been able to achieve good results on the data set provided by this task.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>PAN 2024 Multi-Author Writing Style Analysis task is further subdivided into 3
sub tasks with increasing level of difficulty.
1. Easy - The paragraphs of a document cover a variety of topics, allowing
approaches to make use of topic information to detect authorship changes.
2. Medium - The topical variety in a document is small (though still present)
forcing the approaches to focus more on style to effectively solve the
detection task.
3. Hard - All paragraphs in a document are on the same topic.</p>
      <p>
        PAN provided three different data sets for each of these sub tasks. These
datasets are further sub divided into 3 sets one each for Training, Validation
and Test. The proposed system discussed in this paper is trained using the
above training dataset and validated with the Validation dataset. The trained
system is submitted to TIRA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] platform where the system was evaluated
based on the Test dataset.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Related Works</title>
      <p>
        PAN at CLEF, have been in the past, successively conducted style change
detection tasks since 2017. Some of the work in this area include Supervised
Contrastive Learning for Multi-Author Writing Style Analysis [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in 2023,
Ensemble-Based Clustering for Writing Style Change Detection in Multi-Authored
Textual Documents [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Style Change Detection Based On Bi-LSTM And Bert
in 2022 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The last one proposed a system using a neural network involving
BiLSTM and CNN with BERT embeddings as the input. The system proposed in this
paper is to some extent based on this work but differs in the structure of the
neural network and the input to it. Other similar works include - Style Change
Detection using Siamese Neural Networks [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This proposed system included a
Siamese network with GloVe embedding layer, a Bi-LSTM layer along with other
layers.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. System Overview</title>
      <p>In the proposed system the input text is divided into pairs of consecutive
paragraphs. The system then generates embeddings based on multiple
stylometric features extracted from the input text. These features include:
 TFIDF for character n-grams
 TFIDF for n-grams of POS tags
 TFIDF for n-grams of POS tag chunks
 TFIDF for punctuation marks used in the text
 Frequency of stop words
 Count of characters in the text
 Count of words in the text
Such multiple features are extracted from the text to better represent the
style of the author. These embeddings are fed into a neural network which is
trained on the training data to predict if a pair of paragraphs has similar
stylometric properties or not.</p>
      <p>The neural network consists of a combination of one dimensional convolution
neural network and Bi-directional LSTM layers which are concatenated along
with Global Max pooling followed by a dense layer. The final (output) layer
does the classification. Fig 1. shows the structure of this neural network. The
neural network was trained 3 times one with each dataset corresponding to
the subtasks (Easy, Medium and Hard) and three different models were
generated for each sub task.</p>
      <p>LSTM (Long Short-Term Memory) network is a special type of recurrent neural
network which is better suited for maintaining long range connections within a
sequence. Bi-LSTM (Bidirectional LSTM) is a combination of two LSTM layers
with inputs flowing from both directions unlike LSTM where the input flows
only in one direction. In other words, Bi-LSTM can analyse both past and future
information and thus give a more meaningful output especially in natural
language processing. In the proposed system Bi-LSTM layer is set with dropout
of 0.2. Adding dropouts improves the generalization and avoids over fitting the
training data.</p>
      <p>Convolutional Neural Network (CNN) extracts important features from the
input which helps in reducing the number of features and thereby improving
the accuracy and performance of the model. In the proposed system the CNN
layer uses ‘Relu’ as the activation function. This is followed by the Global Max
Pooling which reduces the input dimensions thereby reducing the input
parameters. This helps in further improving the accuracy and speed. Further
the output of the Bi-LSTM and the global max pooling is concatenated
together followed by a dense layer with ‘Relu’ activation function. Finally, the
output layer produces the classification using ‘Softmax’ function.</p>
      <p>Fig 1. Structure of the proposed neural network
The system was trained on two different data sets one from PAN 2023 and the
other from PAN 2024 style change detection tasks. Both these datasets are
similar in structure. These datasets are divided into three parts based on the
increasing levels of difficulty (Easy, Medium and Hard) as described earlier.</p>
      <p>With this training two models were generated.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The proposed system was submitted to the PAN Multi-Author Writing Style
Analysis task. The two submissions based on the two sets for models (one
based on the PAN 2023 style change detection Task and the other based on
the PAN 2024 style change detection Task) were named ‘rancid-factor’ and
‘knurled-starter’ respectively. Going forward these systems would be name
System1 and System2 respectively. The three different models trained earlier
did the predictions for the three different subtasks. The results of the run on
the validation data shows F1 Scores for the 3 tasks in Table 1. Table 2 shows
the results of the run on the Test data set. The score of the two baseline
predictors are also mentioned in Table 2. The first baseline predictor (Baseline
Predict 1) always predicts 1 i.e. change in the author between the consecutive
paragraphs and the second baseline predictor (Baseline Predict 2) always
predicts 0 i.e. no change in the author between the consecutive paragraphs of
a document.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The two systems performed much better than the two baseline systems
provided. System2 performed much better in Task1 but both the systems got
similar scores for Task 2 and Task 3. Both the systems did not do well in the Task
3 which is corresponding to the ‘Hard’ subtask which means more work is
required in the area where the variety of the topics were very less and the
system needs to be more style oriented rather than topic oriented. This calls for
a better feature extraction techniques.</p>
    </sec>
    <sec id="sec-7">
      <title>7. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. B.</given-names>
            <surname>Casals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elnagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Korenčić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smirnova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taulé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ustalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle, Overview of PAN 2024:
          <article-title>MultiAuthor Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <source>Overview of the Multi-Author Writing Style Analysis Task at PAN</source>
          <year>2024</year>
          , Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, 2024</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolyada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Grahm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elstner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Loebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Continuous Integration for Reproducible Shared Tasks with TIRA.io</article-title>
          , in: J.
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Kruschwitz</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Caputo (Eds.),
          <source>Advances in Information Retrieval. 45th European Conference on IR Research (ECIR</source>
          <year>2023</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2023</year>
          , pp.
          <fpage>236</fpage>
          -
          <lpage>241</lpage>
          . URL: https://link.springer.com/chapter/10.1007/978-3-
          <fpage>031</fpage>
          -28241-6_
          <fpage>20</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -28241-6_
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qi</surname>
          </string-name>
          , Y. Han,
          <article-title>Supervised Contrastive Learning for MultiAuthor Writing Style Analysis</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.), Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jiayang</given-names>
            <surname>Zia</surname>
          </string-name>
          , Ling Zhoua, Zhengyao Liua,
          <source>Style Change Detection Based On Bi-LSTM And Bert, CLEF 2022 Labs and Workshops</source>
          , Notebook Papers, CEUR-WS.org,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Style Change Detection Based On Bi-LSTM And Bert, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nath</surname>
          </string-name>
          ,
          <article-title>Style Change Detection using Siamese Neural Networks</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          (Eds.),
          <article-title>CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>