<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Team riyahsanjesh at PAN: Multi-feature with CNN and Bi-LSTM Neural Network Approach to Style Change Detection Notebook for the PAN Lab at CLEF 2024</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Riya</forename><surname>Sanjesh</surname></persName>
							<email>riya.sanjesh@presidencyuniversity.in</email>
							<affiliation key="aff0">
								<orgName type="institution">Presidency University</orgName>
								<address>
									<settlement>Ittagallpura, Bengaluru</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alamelu</forename><surname>Mangai</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Presidency University</orgName>
								<address>
									<settlement>Ittagallpura, Bengaluru</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Team riyahsanjesh at PAN: Multi-feature with CNN and Bi-LSTM Neural Network Approach to Style Change Detection Notebook for the PAN Lab at CLEF 2024</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">0655C45247B62D4F7BA87D540352B54D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:52+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>PAN 2024</term>
					<term>Multi-Author Writing Style Analysis</term>
					<term>Stylometric Features</term>
					<term>Deep Learning</term>
					<term>Bi-LSTM</term>
					<term>Convolution Neural Network 1 1</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>PAN 2024 conducted Multi-Author Writing Style Analysis task which aims to detect style changes between consecutive paragraphs in a text. The task provides datasets with three levels of complexity to test the submissions. This paper describes our attempt towards solving this problem. It involves multiple stylometric features extracted from the input text and detecting any style changes using a trained Neural Network based on CNN and Bi-LSTM along with global max pooling layers. The proposed system obtained a F1 score of 0.78, 0.724, 0.601 for the 3 subtasks on validation data set provided.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>With the advent of internet and generative AI tools it is quite easy to copy someone else's work and embed in one's work or to copy from multiple sources and claim to be your work. This puts a great emphasis in the intellectual property rights. Detecting such plagiarism manually is quite difficult. However using a style change detection system greatly improves the accuracy and time of this task. These techniques could also be used to classify authors. This type of analysis along with other textual analyses comes under forensic text analysis. PAN workshops series has been quite active in this area since 2009. One of the tasks conducted as part of PAN at CLEF 2024 <ref type="bibr" target="#b0">[1]</ref> is 'Multi-Author Writing Style Analysis' <ref type="bibr" target="#b1">[2]</ref> which is continuation of a series of such tasks conducted in the past since 2018. The aim of this task in 2024 is to detect stylometric changes across consecutive paragraphs of a document. The task is divided into three sub tasks based on the difficulty levels. Task 1 involves documents that cover a variety of topics. Task 2 also, includes documents with a small variety of topics but not as much as with Task 1. Task 3 on the other hand consists of documents of the same topic. This paper proposes a solution for PAN 2024 Multi-Author Writing Style Analysis task after analyzing the past work around this area. The proposed solution here employs Neural Network model with a combination of CNN and Bi-LSTM along with Global Max Pooling to help detect the style changes.</p><p>Multiple features are extracted from the source documents and embeddings generated before feeding them to the neural network. The proposed system has been able to achieve good results on the data set provided by this task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head><p>PAN 2024 Multi-Author Writing Style Analysis task is further subdivided into 3 sub tasks with increasing level of difficulty.</p><p>1. Easy -The paragraphs of a document cover a variety of topics, allowing approaches to make use of topic information to detect authorship changes.</p><p>2. Medium -The topical variety in a document is small (though still present) forcing the approaches to focus more on style to effectively solve the detection task.</p><p>3. Hard -All paragraphs in a document are on the same topic.</p><p>PAN provided three different data sets for each of these sub tasks. These datasets are further sub divided into 3 sets one each for Training, Validation and Test. The proposed system discussed in this paper is trained using the above training dataset and validated with the Validation dataset. The trained system is submitted to TIRA <ref type="bibr" target="#b2">[3]</ref> platform where the system was evaluated based on the Test dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Related Works</head><p>PAN at CLEF, have been in the past, successively conducted style change detection tasks since 2017. Some of the work in this area include Supervised Contrastive Learning for Multi-Author Writing Style Analysis <ref type="bibr" target="#b3">[4]</ref> in 2023, Ensemble-Based Clustering for Writing Style Change Detection in Multi-Authored Textual Documents <ref type="bibr" target="#b4">[5]</ref> and Style Change Detection Based On Bi-LSTM And Bert in 2022 <ref type="bibr" target="#b5">[6]</ref>. The last one proposed a system using a neural network involving Bi-LSTM and CNN with BERT embeddings as the input. The system proposed in this paper is to some extent based on this work but differs in the structure of the neural network and the input to it. Other similar works include -Style Change Detection using Siamese Neural Networks <ref type="bibr" target="#b6">[7]</ref>. This proposed system included a Siamese network with GloVe embedding layer, a Bi-LSTM layer along with other layers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">System Overview</head><p>In the proposed system the input text is divided into pairs of consecutive paragraphs. The system then generates embeddings based on multiple stylometric features extracted from the input text. These features include:</p><p> TFIDF for character n-grams  TFIDF for n-grams of POS tags  TFIDF for n-grams of POS tag chunks  TFIDF for punctuation marks used in the text  Frequency of stop words  Count of characters in the text  Count of words in the text Such multiple features are extracted from the text to better represent the style of the author. These embeddings are fed into a neural network which is trained on the training data to predict if a pair of paragraphs has similar stylometric properties or not.</p><p>The neural network consists of a combination of one dimensional convolution neural network and Bi-directional LSTM layers which are concatenated along with Global Max pooling followed by a dense layer. The final (output) layer does the classification. Fig 1 <ref type="figure">.</ref> shows the structure of this neural network. The neural network was trained 3 times one with each dataset corresponding to the subtasks (Easy, Medium and Hard) and three different models were generated for each sub task. LSTM (Long Short-Term Memory) network is a special type of recurrent neural network which is better suited for maintaining long range connections within a sequence. Bi-LSTM (Bidirectional LSTM) is a combination of two LSTM layers with inputs flowing from both directions unlike LSTM where the input flows only in one direction. In other words, Bi-LSTM can analyse both past and future information and thus give a more meaningful output especially in natural language processing. In the proposed system Bi-LSTM layer is set with dropout of 0.2. Adding dropouts improves the generalization and avoids over fitting the training data.</p><p>Convolutional Neural Network (CNN) extracts important features from the input which helps in reducing the number of features and thereby improving the accuracy and performance of the model. In the proposed system the CNN layer uses 'Relu' as the activation function. This is followed by the Global Max Pooling which reduces the input dimensions thereby reducing the input parameters. This helps in further improving the accuracy and speed. Further the output of the Bi-LSTM and the global max pooling is concatenated together followed by a dense layer with 'Relu' activation function. Finally, the output layer produces the classification using 'Softmax' function. The system was trained on two different data sets one from PAN 2023 and the other from PAN 2024 style change detection tasks. Both these datasets are similar in structure. These datasets are divided into three parts based on the increasing levels of difficulty (Easy, Medium and Hard) as described earlier.</p><p>With this training two models were generated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>The proposed system was submitted to the PAN Multi-Author Writing Style Analysis task. The two submissions based on the two sets for models (one based on the PAN 2023 style change detection Task and the other based on the PAN 2024 style change detection Task) were named 'rancid-factor' and 'knurled-starter' respectively. Going forward these systems would be name System1 and System2 respectively. The three different models trained earlier did the predictions for the three different subtasks. The results of the run on the validation data shows F1 Scores for the 3 tasks in Table <ref type="table" target="#tab_0">1</ref>. Table <ref type="table" target="#tab_1">2</ref> shows the results of the run on the Test data set. The score of the two baseline predictors are also mentioned in Table <ref type="table" target="#tab_1">2</ref>. The first baseline predictor (Baseline Predict 1) always predicts 1 i.e. change in the author between the consecutive paragraphs and the second baseline predictor (Baseline Predict 2) always predicts 0 i.e. no change in the author between the consecutive paragraphs of a document. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>The two systems performed much better than the two baseline systems provided. System2 performed much better in Task1 but both the systems got similar scores for Task 2 and Task 3. Both the systems did not do well in the Task 3 which is corresponding to the 'Hard' subtask which means more work is required in the area where the variety of the topics were very less and the system needs to be more style oriented rather than topic oriented. This calls for a better feature extraction techniques.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig 1 .</head><label>1</label><figDesc>Fig 1. Structure of the proposed neural network</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="3,102.65,495.65,390.00,286.15" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>F1 score of proposed system run on the Training data set</figDesc><table><row><cell>Task</cell><cell>Task 1 Task 2 Task 3</cell></row><row><cell>rancid-factor</cell><cell>0.78 0.724 0.601</cell></row><row><cell>knurled-starter</cell><cell>0.825 0.712 0.599</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Overview of the F1 accuracy for the multi-author writing style task in detecting at which positions the author changes for task 1, task 2, and task 3.</figDesc><table><row><cell>Approach rancid-factor knurled-starter Baseline Predict 1 Baseline Predict 0</cell><cell>Task 1 Task 3 0.635 0.638 0.825 0.599 0.466 0.320 0.112 0.346</cell><cell>Task 2 0.733 0.712 0.343 0.323</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bevendorff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">B</forename><surname>Casals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chulvi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dementieva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Elnagar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Freitag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fröbe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Korenčić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mayerl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mukherjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Panchenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Smirnova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Taulé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ustalov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wiegmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zangerle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2024)</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Overview of the Multi-Author Writing Style Analysis Task at PAN</title>
		<author>
			<persName><forename type="first">E</forename><surname>Zangerle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mayerl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum</title>
				<imprint>
			<date type="published" when="2024">2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Continuous Integration for Reproducible Shared Tasks with TIRA</title>
		<author>
			<persName><forename type="first">M</forename><surname>Fröbe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wiegmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kolyada</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Grahm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Elstner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Loebe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hagen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-28241-6_20</idno>
		<idno>doi:</idno>
		<ptr target="10.1007/978-3-031-28241-6_20" />
	</analytic>
	<monogr>
		<title level="m">Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023)</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">J</forename><surname>Kamps</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Crestani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Maistro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Joho</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Davis</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Gurrin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">U</forename><surname>Kruschwitz</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Caputo</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="236" to="241" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Supervised Contrastive Learning for Multi-Author Writing Style Analysis</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Han</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2023 -Conference and Labs of the Evaluation Forum</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Aliannejadi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Vlachos</surname></persName>
		</editor>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Style Change Detection Based On Bi-LSTM And Bert</title>
		<author>
			<persName><forename type="first">Jiayang</forename><surname>Zia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ling</forename><surname>Zhoua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhengyao</forename><surname>Liua</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">CLEF 2022 Labs and Workshops</title>
		<title level="s">Notebook Papers</title>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Style Change Detection Based On Bi-LSTM And Bert</title>
		<author>
			<persName><forename type="first">J</forename><surname>Zi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">CLEF 2022 Labs and Workshops</title>
		<title level="s">Notebook Papers</title>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Style Change Detection using Siamese Neural Networks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Nath</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">CLEF 2021 Labs and Workshops</title>
		<title level="s">Notebook Papers</title>
		<editor>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Maistro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Piroi</surname></persName>
		</editor>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
