<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Graph-based and Lexical-Syntactic Approaches for the Authorship Attribution Task Notebook for PAN at CLEF 2012</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Esteban</forename><surname>Castillo</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Benemérita Universidad Autónoma de Puebla Faculty of Computer Science</orgName>
								<address>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Darnes</forename><surname>Vilariño</surname></persName>
							<email>darnes@cs.buap.mx</email>
							<affiliation key="aff0">
								<orgName type="institution">Benemérita Universidad Autónoma de Puebla Faculty of Computer Science</orgName>
								<address>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">David</forename><surname>Pinto</surname></persName>
							<email>dpinto@cs.buap.mx</email>
							<affiliation key="aff0">
								<orgName type="institution">Benemérita Universidad Autónoma de Puebla Faculty of Computer Science</orgName>
								<address>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Iván</forename><surname>Olmos</surname></persName>
							<email>iolmos@cs.buap.mx</email>
							<affiliation key="aff0">
								<orgName type="institution">Benemérita Universidad Autónoma de Puebla Faculty of Computer Science</orgName>
								<address>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jesús</forename><forename type="middle">A</forename><surname>González</surname></persName>
							<email>jagonzalez@inaoep.mx</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Institute of Astrophysics, Optics</orgName>
								<orgName type="department" key="dep2">Electronics Computer Science Department</orgName>
								<address>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Maya</forename><surname>Carrillo</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Benemérita Universidad Autónoma de Puebla Faculty of Computer Science</orgName>
								<address>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Graph-based and Lexical-Syntactic Approaches for the Authorship Attribution Task Notebook for PAN at CLEF 2012</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">0FEE3BE4D2FE8659A4709ACB00CD8288</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T05:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Authorship attribution</term>
					<term>graph-based representation</term>
					<term>phrase-level lexicalsyntactic features</term>
					<term>support vector machines</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we present two different approaches for tackling the authorship attribution task. The first approach uses a set of phrase-level lexicalsyntactic features, whereas the second approach considers a graph-based text representation together with a data mining technique for discovering authorship patterns which may be further used for attributing the authorship of an anonymous document. In both cases we employed a support vector machine classifier in order to determine the target class. The features extracted by means of the graph-based approach allowed it to obtain a better performance than the other approach.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Discovering the correct features in a raw text which allows unambiguously to attribute the authorship of a given anonymous document is a very hard task. In recent years, there have been a number of research papers in this direction. The traditional authorship attribution task consists of determining the correct authorship of an anonymous document, using a supervised collection of documents, i.e., a reference set of documents manually tagged with their corresponding authorship attribution. In other words, this task can be seen as a classification problem in which the target tag or class is the author name/ID.</p><p>Determining the authorship of an anonymous document is a task that has been tackled for several years by the computational linguistic community. An effort that has been empowered by the continuous growing of information in Internet. In this sense, the importance of finding the correct features for characterizing the signature or particular writing style of a given author is fundamental for solving the problem of authorship attribution.</p><p>The results reported in this paper were obtained in the framework of the 6th International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN'12). In particular, in the task named "Traditional Authorship Attribution" which has the following two sub-tasks:</p><p>-Traditional (closed class / open class, with a variable number of candidate authors).</p><p>This subtask consists of assigning the real author name to a given anonymous document, using a set of candidate authors as reference. -Clustering. A target number of paragraphs are required to be clustered into groups (between one or four) in order to obtain clusters of paragraphs that correspond to the same author.</p><p>For this purpose, we attempted two different techniques for representing the features that will be taken into account in the process of authorship attribution. The proposed approaches are discussed in the following sections.</p><p>The rest of this paper is structured as follows. In Section 2 it is presented the description of the features used in the task to be tackled. Section 3 shows the classification methods (supervised and unsupervised) employed in the experiments. The experimental setting and a discussion of the obtained results are given in Section 4. Finally, the conclusions of this research work is presented in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Description of the features used in the task</head><p>In this work we explore two very different text representation schemas. The first approach considers lexical-syntactic features, whereas the second uses a data mining based process for extracting the most relevant terms of the target documents. Both schemas are described as follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Lexical-syntactic feature approach</head><p>In this approach are considered the following lexical-syntactic features for representing the particular writing style of a given author:</p><p>-Phrase level features • Vowel combination. Word consonants are removed and, thereafter, the remaining vowels are combined. Each vowel combination is considered to be a feature. Adjacent repetition of vowels are considered as only one vowel. • Vowel permutation. Word consonants are removed and, thereafter, the vowel permutation is considered to be a feature.</p><p>The text representation by means of the above mentioned features is described in Section 2.3.</p><p>1 POS-tagger of NLTK</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Graph-based approach</head><p>In this approach, a graph based representation is considered <ref type="bibr" target="#b4">[5]</ref>. Given a graph G = (V, E, L, f ) with V being the non-empty set of vertices, E ⊆ V × V the edges, L the tag set, and f : E → L, a function that assigns a tag to a pair of associated vertices. Each text paragraph is tagged with its corresponding PoS tags, in this case, using the TreeTagger tool 2 . Each word is stemmed using the Porter stemmer <ref type="foot" target="#foot_0">3</ref> . In this type of text representation, each vertex is considered to be a stemmed word and each edge is considered to be its corresponding PoS tag. The word sequence of the paragraphs to be represented is kept. The tag set of PoS used in the experiments is shown in Table <ref type="table" target="#tab_1">1</ref>. In order to demonstrate the way we construct the graph for each phrase, consider the following text phrase: "second qualifier long road leading 1998 world cup". The associated graph representation is shown in Figure <ref type="figure" target="#fig_0">1</ref>.</p><p>The Subdue tool Once each paragraph is represented by means of a graph, we apply a data mining algorithm in order to find subgraphs. Subdue is a data mining tool widely used in structured domains. This tool has been used for discovering structured patterns in texts represented by means of graphs <ref type="bibr" target="#b1">[2]</ref>. Subdue uses an evaluation model named "Minimum encoding", a technique derived from the minimum description length principle <ref type="bibr" target="#b2">[3]</ref>, in which the best graph sub-structures are chosen. The best subgraphs are those that minimize the number of bits that represent the graph. In this case, the number of bits is calculated considering the size of the graph adjancency matrix. Thus, the best substructure is the one that minimizes I(S) + I(G|S), where I(S) is the number of bits required to describe the substructure S, and I(G|S) is the number of bits required to describe graph G after it has been compacted by the substructure S.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Text representation schema</head><p>Let (x 1 , x 2 , x 3 , • • • , x n ) be the set of features selected for representing the documents. Each document D is represented considering the feature frequency, i.e., we used the bag of words representation for each document <ref type="bibr" target="#b0">[1]</ref>. It is worth noting that both approaches (lexical-syntactic and graph-based) use the same text representation schema.</p><p>The training stage uses the following feature vector:</p><formula xml:id="formula_0">D = (x 1 , x 2 , x 3 , . . . , x n Document f eatures , C)<label>(1)</label></formula><p>where C is the class manually associated to the document, in this case, the author Name or ID.</p><p>For the testing stage, we use the feature vector as follows:</p><formula xml:id="formula_1">D = (x 1 , x 2 , x 3 , . . . , x n Document f eatures )<label>(2)</label></formula><p>In this case, there is not a classification attribute (class name) due to the anonymous source of the document.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Description of the classifiers used in the task</head><p>We have used a Support Vector Machine (SVM) classifier for the problems A, B, C, D, I and J (see Section 4.1). SVM is a learning method based on the use of a hypothesis space of lineal functions in a higher dimensional space induced by a kernel, in which the hypotheses are trained by one algorithm that uses elements of the generalization theory and taken from the optimization theory.</p><p>The linear learning machines are barely used in major real world applications due to their computational limitations. Kernel based representations are an alternative for this problem proyecting the information to a feature space of higher dimensionality which increases the computational capacity of the linear learning machines. The input space X is mapped to a new feature space as follows:</p><formula xml:id="formula_2">x = {x 1 , x 2 , . . . , x n } → φ(x) = {φ(x) 1 , φ(x) 2 , . . . , φ(x) n }<label>(3)</label></formula><p>By employing the kernel function, it is not necessary to explicitly calculate the mapping φ : X → F in order to learn in the feature space.</p><p>In this research work, we employed as kernel the polynomial mapping, which is a very popular method for modeling non-linear functions:</p><formula xml:id="formula_3">K(x, x) = ( x, x + c) d<label>(4)</label></formula><p>where c ∈ R.</p><p>For problems E and F, we have employed the K-means clustering method, representing the documents with the six lexical-syntactic features previously presented. Kmeans is a cluster analysis method that aims to partition n observations into K clusters, considering that each observation belongs to the cluster with the closest median.</p><p>In the experiments carried out in this paper, we used the Weka data mining platform <ref type="bibr" target="#b3">[4]</ref> for executing the implementations of SVM and the K-means classifier.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental results</head><p>The results obtained with both approaches are discussed in this section. First, we describe the dataset used in the experiments and, thereafter, the obtained results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Data sets</head><p>The description of the eight text collections used in the experiments (six for the traditional sub-task and two for the clustering sub-task) is shown in Table <ref type="table" target="#tab_2">2</ref>. As can be seen, the data set is made up of different authors. Actually, the first and second text collections (A and B) contain three different authors, the third and fourth collections (C and D) contain eight different authors, the fifth and sixth collections (I and J) contain 14 different authors, and finally, the seventh and eighth collections (E and F) contain mixed and intrusive paragraphs from 1 to 4 different authors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Results obtained in the traditional sub-task</head><p>In Table <ref type="table" target="#tab_3">3</ref> are shown the results obtained for the problems A, B, C, D, I and J of the traditional sub-task.</p><p>In order to tackle the open-class problems (B, D, and J), the training data set was enriched with documents written by unknown authors. The number of documents added  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Results obtained in the clustering sub-task</head><p>Table <ref type="table" target="#tab_4">4</ref> shows the results obtained in the problems E and F of the clustering sub-task. Different runs varying the number of clusters (K) were sent to the competition, however, we are not aware of the final run (the number K) reported in this paper, which was evaluated by the competition organizers. The different runs were motivated by an empirical value selected by means of the cosine similarity among different paragraphs of the text collection. This metric was a clue for determining the final number of clusters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusions</head><p>In this paper we presented an exploration of two different approaches. On the first hand, we employed a graph for representing text paragraphs by means of words and their corresponding part of speech tags. We aimed to consider the morphosyntactical structure of the text (at once) for further discovering of the best features for the final representation of the training and test data. On the other hand, we evaluated a set of six lexical-syntactic features with the purpose of determining those that allow finding an appropriate signature for a given author. A higher number of features (phrase and word level) were independently evaluated, and those that provided the best discrimination scores were selected for the final evaluation.</p><p>In general, we observed that the graph-based representation obtained a better performance than the other one. However, more investigation on the graph representation is still required, so that graph patterns discovered by the Subdue tool are better than the ones obtained until now. As future work, we want to experiment with different graphbased text representations that allow us to obtain much more complex patterns.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Graph based text representation with words and their corresponding PoS tags</figDesc><graphic coords="4,134.77,115.86,366.59,133.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 .</head><label>1</label><figDesc>Description of PoS tags used</figDesc><table><row><cell cols="2">PoS tag Description</cell></row><row><cell>JJ</cell><cell>Adjective</cell></row><row><cell>VBN</cell><cell>Verb -Past participle</cell></row><row><cell cols="2">WDT Determiner</cell></row><row><cell>NN</cell><cell>Noun</cell></row><row><cell>CD</cell><cell>Number</cell></row><row><cell>RB</cell><cell>Adverb</cell></row><row><cell>NNS</cell><cell>Noun -Singular</cell></row><row><cell>CC</cell><cell>Conjuntion</cell></row><row><cell>RBR</cell><cell>Adverb -Comparative</cell></row><row><cell>MD</cell><cell>Modal</cell></row><row><cell>JJR</cell><cell>Adjective -Comparative</cell></row><row><cell>VBG</cell><cell>Verb -Present participle</cell></row><row><cell>VBD</cell><cell>Verb -Past</cell></row><row><cell>VBP</cell><cell>Verb -Present, not the 3rd person singular</cell></row><row><cell>VBZ</cell><cell>Verb -Present, 3rd person singular</cell></row><row><cell>FW</cell><cell>Unknown word</cell></row><row><cell>PRP</cell><cell>Possessive pronoun</cell></row><row><cell>VB</cell><cell>Verb in base form</cell></row><row><cell>NNP</cell><cell>Noun -Plural</cell></row><row><cell>RBS</cell><cell>Adverb -Superlative</cell></row><row><cell>IN</cell><cell>Preposition and conjunction</cell></row><row><cell>JJS</cell><cell>Adjective superlative</cell></row><row><cell>PDT</cell><cell>Predeterminer</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 .</head><label>2</label><figDesc>Data set used in the experiments</figDesc><table><row><cell>Task</cell><cell>Problem</cell><cell>type</cell><cell cols="3">Authors Documents training Documents test</cell></row><row><cell>Traditional</cell><cell>A</cell><cell>closed-class</cell><cell>3</cell><cell>6</cell><cell>6</cell></row><row><cell>Traditional</cell><cell>B</cell><cell>open-class</cell><cell>3</cell><cell>6</cell><cell>10</cell></row><row><cell>Traditional</cell><cell>C</cell><cell>closed-class</cell><cell>8</cell><cell>16</cell><cell>8</cell></row><row><cell>Traditional</cell><cell>D</cell><cell>open-class</cell><cell>8</cell><cell>16</cell><cell>17</cell></row><row><cell>Traditional</cell><cell>I</cell><cell>closed-class</cell><cell>14</cell><cell>28</cell><cell>14</cell></row><row><cell>Traditional</cell><cell>J</cell><cell>open-class</cell><cell>14</cell><cell>28</cell><cell>16</cell></row><row><cell>clustering</cell><cell>E</cell><cell>mixed paragraph documents</cell><cell>1-4</cell><cell cols="2">2(6 paragraphs) 3 (90 paragraphs)</cell></row><row><cell>clustering</cell><cell>F</cell><cell cols="2">intrusive paragraph documents 1-4</cell><cell cols="2">2(17 paragraphs) 4(80 paragraphs)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 .</head><label>3</label><figDesc>Results obtained in the traditional sub-task same that the number of documents of each original collection. As may be seen in the obtained results, the best results were obtained with the graph-based representation, in which the best features were discovered with the Subdue tool. The closed-class problems obtained a better performance than the open-class ones which encourages us to investigate better methods for tackling this particular issue. It is worth noting that we retrieved almost all the authors for the open-class problem J. We consider that the number of training data is an important factor on this behavior, in other words, we consider increasing the amount of information provided by the authors.</figDesc><table><row><cell>Task</cell><cell cols="6">A correct/A% B correct/B% C correct/C% D correct/D% I correct/I% J correct/J%</cell></row><row><cell>Graph-based approach</cell><cell>5/83.333</cell><cell>6/60</cell><cell>5/62.5</cell><cell>4/23.529</cell><cell>8/57.142</cell><cell>13/81.25</cell></row><row><cell>Lexical-syntactic approach</cell><cell>4/66.666</cell><cell>3/30</cell><cell>2/25</cell><cell>6/35.294</cell><cell>10/71.428</cell><cell>7/43.75</cell></row><row><cell>was exactly the</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4 .</head><label>4</label><figDesc>Results obtained in the clustering sub-task</figDesc><table><row><cell>Task</cell><cell cols="2">E correct/E% F correct/F%</cell></row><row><cell>Graph-based approach</cell><cell>68/75.555</cell><cell>43/53.75</cell></row><row><cell cols="2">Lexical-Syntactic approach 61/67.777</cell><cell>51/63.75</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">http://tartarus.org/ martin/PorterStemmer/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Introduction to Information Retrieval</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schtze</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
			<publisher>Cambridge University Press</publisher>
			<pubPlace>New York, NY, USA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Subgraph isomorphism detection using a code based representation</title>
		<author>
			<persName><forename type="first">I</forename><surname>Olmos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Osorio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">FLAIRS Conference</title>
				<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="474" to="479" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Stochastic Complexity in Statistical Inquiry Theory</title>
		<author>
			<persName><forename type="first">J</forename><surname>Rissanen</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1989">1989</date>
			<publisher>World Scientific Publishing Co., Inc</publisher>
			<pubPlace>River Edge, NJ, USA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Data Mining: Practical Machine Learning Tools and Techniques</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Frank</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Morgan Kaufmann Series in Data Management Sys</title>
				<imprint>
			<publisher>Morgan Kaufmann</publisher>
			<date type="published" when="2005-06">June 2005</date>
		</imprint>
	</monogr>
	<note>second edn</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">gspan: Graph-based substructure pattern mining</title>
		<author>
			<persName><forename type="first">X</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Han</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDM</title>
				<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="721" to="724" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
