<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the Empirical Evaluation of Author Identification Hybrid Method Notebook for PAN at CLEF 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Seifeddine Mechti</string-name>
          <email>mechtiseif@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maher Jaoua</string-name>
          <email>maher.jaoua@fsegs.rnu.tn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rim Faiz</string-name>
          <email>Rim.faiz@ihec.rnu.tn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lamia Hadrich Belguith</string-name>
          <email>l.belguith@fsegs.rnu.tn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bassem Bsir</string-name>
          <email>Bassem.bsir@yahoo.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ANLP Group, MIRACL Laboratory, University of Sfax</institution>
          ,
          <addr-line>3018,Sfax Tunsia</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IHEC of Carthage</institution>
          ,
          <addr-line>2016 Carthage Présidence</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LARODEC Laboratory, ISG of Tunis B.P.</institution>
          <addr-line>1088, 2000 Le Bardo</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <fpage>6</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>In this paper we focus on the identification of the author of a written text. We present a new hybrid method that combines a set of stylistic and statistical features in a machine learning process. We tested the effectiveness of the linguistic and statistical features combined with the inter-textual distance "Delta" on the PAN @CLEF 20E15nglish corpus and we obtained 0.59 as c@1 precision.</p>
      </abstract>
      <kwd-group>
        <kwd>Author Identification</kwd>
        <kwd>machine learning</kwd>
        <kwd>sub corpus</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The task, which consists in deciding by automatic means whether a text T was
written by an author A, fits into the research field that focuses on author
identification. The benefit of automating this task is substantial because of its
usefulness in various fields such as forensic analysis, forensic linguistics, electronic
commerce and plagiarism detection. In the latter case, the probability that a text
contains plagiarism becomes important if the attribution of two parts of this text is not
assigned to the same author.</p>
      <p>In the literature, the automation of the author attribution task can draw on stylistic
or statistical attributes. Currently, learning techniques are being used to infer
attributes that discriminate the styles of authors. It is in this context that we propose in
this paper a hybrid method that combines stylistic and statistical attributes while
relying on measurements of inter-textual distances. We will present the results of our
experiments, using several learning techniques.</p>
      <p>To explain the steps of our method, its implementation, and findings, we organized
this paper into five sections including this introduction. The second section is devoted
to an overview of the main work in this field. The third section details the proposed
method and specifies the stylistic and statistical features used. The fourth section
presents the experiments and evaluations conducted on the corpus disseminated
during the PAN @CLEF'2015conference. The conclusion reiterates the main
findings and presents some prospects that are still in the experimentation stage.</p>
    </sec>
    <sec id="sec-2">
      <title>Overview of Automatic Author Identification Methods</title>
      <p>An overview of the main works in the field of author identification allows us to
identify three types of approaches [1]. The first is based on the stylistic analysis of
documents and aims to identify style invariants that help us distinguish the writings of
an author from those of another author. The second approach is based on multivariate
statistical analyses and aims to identify the joint distribution of some style variables
making it possible to decide whether two texts show a significant correlation of style.
The third approach, described as recent, is based on machine learning algorithms and
seeks to build classifiers that infer the lexical and syntactic attributes that characterize
the style of an author.</p>
      <p>The basic idea of the stylistic methods is structured around the modeling of the
authors from a linguistic point of view so that we can compare their writings. We cite
as an example the work of Li et al. in which they focused on topographical signs
[2] and the work of Zheng et al. in which they were interested in the co-occurrence of
character n_grams [3]. Other works were concerned with the distribution of function
words [4] or the complexity of vocabulary [5]. In another work, Raghavan et al.
capitalized on the probabilistic context-free grammars to model the grammar used by
an author [6]. Feng et al. based their research on the syntactic functions of words and
their inter-relationships in order to discern the complex constructions used by each
author [7]. Other studies focused on the semantic dependency between the words of
written texts through the use of taxonomies [8]. Finally, and in a critical study,
Baayen demonstrated that stylistic methods show weak performance in the analysis of
short texts [9]. Moreover, he demonstrated that style can change over time or
according to the literary genre of texts (poetry, novels, plays ...).</p>
      <p>The first attempts tried to compare the occurrence frequency of certain numbers of
functional words (determiners, prepositions, conjunctions, and pronouns) [10].
However, the results of the evaluation of this method prove its limitations and it is for
this reason that other studies experimented with multivariate statistical indices. We
cite as examples the principal component analysis [11]. Other methods use
probabilistic measurements of distance such as the inter-textual distance [12], the
LDA distribution [13], the KL divergence distance between the hidden Markov
models [14] and the Ç2distance [15].</p>
      <p>[16] puts forward a statistical rule called "Delta rule" which is based on the set of
the most frequent terms (between 40 and 150), especially function words. It is
noteworthy that this rule has been used by numerous studies in the field of author
identification [17,18]. For his part, Savoy puts forward a probabilistic model for the
attribution of documents addressing several topics [19]. In this framework, each
document of a given corpus is modeled as a distribution of different themes, each
theme representing a specific distribution of words.</p>
      <p>The use of machine learning techniques stems from the observation that the task of
author identification can be seen as a classification problem [1]. The methods which
are part of this approach hinge on two stages: the first consists in representing the
source texts as vectors of labeled and multivariate words. The second consists in
using learning techniques to identify the boundaries of each class, meanwhile
minimizing a classification loss function. To construct the classification model,
several techniques have been adopted such as the discriminating analysis [19], SVM
[2], the decision trees [20], the neural networks [3], the methods of sets of classifiers
[1] and the theme models [18]. It should be noted that other studies have compared
the performance of some classifiers for the author identification task [20].</p>
    </sec>
    <sec id="sec-3">
      <title>3 Proposed Method</title>
      <p>Hybridization has always been considered an interesting track because it
overcomes the limitations of combined approaches. It is with this objective in mind
that we tried to experiment with learning techniques on all the stylistic and statistical
features that have shown their efficiency in the literature. The basic idea is to create
for each text T, whose belonging to an author A we want to verify, a sub corpus
which includes all the texts written by this author and the texts that are close to it in
terms of distance. Thus, if the text was written by author A then there is a high
probability that we recognize the style via the stylistic and statistical features of
author A s textbselonging to the corpus. On the other hand, if A is not the writer of T
then there is a good chance that it is assigned to another author selected from the rest
of the sub corpus.</p>
      <sec id="sec-3-1">
        <title>Initial corpus CLEF 2015</title>
        <sec id="sec-3-1-1">
          <title>Determining the statistical and stylistic features</title>
          <p>Learning
matrix</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Extraction of the sub corpus</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Sub corpus CLEF 2015 New text</title>
        <sec id="sec-3-2-1">
          <title>Machine</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Learning</title>
          <p>Decision</p>
          <p>In order to implement the proposed method, we developed a system called
HyTAI (Hybrid Tool for Author Identification) whose modular decomposition
follows the proposed method. Thus, we used the Delta rule in the extraction module
of the sub corpus to calculate the distance between two texts. Also, we used the
OpenNLP for the extraction of the stylistic and statistical features.</p>
          <p>
            To calculate the distance between two documents, we used the Delta distance
proposed by Burrows et al.
            <xref ref-type="bibr" rid="ref15">(Burrows 2002)</xref>
            . This distance, which takes into account
the most frequent words, is characterized by the following formula:
m
D(Q, A j ) =1m å|Zscore(t iq ) -Zscore(t ij ) |
          </p>
          <p>i=1
where</p>
          <p>Z score(ttij)=
tfr ij -mean i</p>
          <p>sd i</p>
          <p>Note that tfrij is the frequency of the term ti in the document Dj while meani is the
mean and SDI is the standard deviation.</p>
          <p>It should be noted that if two texts are quite close, then delta tends toward 0.
Similarly, the value m may vary from one corpus to another and that is why we
conducted an experiment to have the value determined (see next section). For the
training sub corpus, we choose the nearest texts of a document to be checked in such a
way that a balance is achieved between the texts written by the author to be identified
and the texts that do not belong to that author.</p>
          <p>In order to extract the stylistic and statistical features, we used tools from the
Apache OpenNLP library, which contains a set of functions that can segment texts
and perform the syntactic and lexical analyses. We calculated the frequency of lexical
features, the ratio V / N where V is the hapax s size and N is thetext length and
the average length of sentences. Regarding parsing, also conducted through the
OpenNLP, we extract the number of nouns, the number of verbs, the number of
adjectives, the number of adverbs, and the number of prepositions.</p>
          <p>Then to extract the features related to the model of the language, we consider the text
as a simple sequence of characters and determine the frequencies of the letters, the
punctuation marks and the numeric characters as well as n-grams.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>To evaluate the HyTAI system, we conducted a series of experiments which aim to
determine the ideal parameters such as the threshold of the Delta distance as well as
the most suitable algorithm for learning. We used the corpus disseminated at the
PAN @CLEF 201c5onference. This corpus consists of 200 collections of English
documents (essays) which include 518 known texts and 200 unknown texts. The
average length of the documents is around of 833 words per document.</p>
      <p>To evaluate the performance of our HyTAI system, we used the c @ 1 measure
adopted by the PAN @CLEF'2051 conference and defined by Penas et al., (Penas,
Rodrigo, 2011). Compared to conventional measures of precision, this measure has
the advantage of taking into account the indecisions of the system, that is to say where
the system cannot decide on the authorship of the document concerned. The formula
proposed for the calculation of c @ 1 is as follows:
where n is the total number of problems; nc = number of correct decisions;
nu = number of cases of indecision</p>
      <p>The following figure shows the c @ 1 measurement results obtained via 6
classifiers on the corpus. For an unknown text whose authorship we want to identify,
we create a sub corpus containing known texts by the author and the same number of
texts that are close in terms of Delta distance from unknown texts that do not belong
to the author. The classifiers used in this experiment are: SVM, Bayesian Networks,
Naive Bayes, Decision tables, Decision tree and KNN.</p>
      <p>In our case, indecision stems from the fact that we obtain with the classifier (in the
cases of SVM, Bayes, KNN median values (close to 0.5. So for these classifiers,
indecision results when the value obtained is in the range [0.4-0.6].</p>
      <p>According to the histogram, the best results (the C @ 1 axis value are obtained
through the SVM algorithm, followed by the naive Bayes classifier.</p>
      <p>To determine the optimal threshold used by the Delta distance, we conducted an
experiment with the various values of c@1 by varying the threshold from 50 to 400.
In this experiment, we set the SVM algorithm as a classifier.</p>
      <p>As shown in the previous figure, with a size of 60% of the English Corpus, the
HyTai system got an accuracy rate for c@1 which is equal to 0.59.</p>
      <p>6 Conclusion</p>
      <p>In this paper, we presented an automatic author identification method which is
based on the combination of statistical and stylistic features while relying on the SVM
learning algorithm. The results obtained on the corpus of the PAN @ CLEF'2015
conference prove the interest of hybridization, and the importance of statistical
features. However, these results do not satisfy our ambitions; that is why we are
planning during our next participation in the PAN conference to change the training
corpus by choosing different textual forms that derive from the text. This procedure is
intended to "fill" the training corpus and thus to reach more accurate decisions. The
first results, that we are about to experiment in this direction, are very promising.
Also, we plan to extend our method to take into account the other languages put
forward in the author identification task. Within this framework, we will focus more
on the statistical features and those derived from the language model.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          ,
          <fpage>538</fpage>
          <lpage>5562</lpage>
          .009.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Li J.</given-names>
            ,
            <surname>Zheng</surname>
          </string-name>
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Chen</surname>
          </string-name>
          <string-name>
            <surname>H</surname>
          </string-name>
          .
          <article-title>From ngerprint to writeprint</article-title>
          .
          <source>Communication ACM</source>
          <volume>49</volume>
          (
          <issue>4</issue>
          ),
          <fpage>76</fpage>
          <lpage>82</lpage>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Zheng R.</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Chen</surname>
          </string-name>
          <string-name>
            <surname>H.</surname>
          </string-name>
          ,
          <article-title>Huang Z. A framework for authorship identification of online messages: Writing-style features and classification techniques</article-title>
          .
          <source>American Society for Information Science and Technology</source>
          <volume>57</volume>
          (
          <issue>3</issue>
          ),
          <fpage>378</fpage>
          -
          <lpage>393</lpage>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Vartapetiance A.</given-names>
            ,
            <surname>Gillam L</surname>
          </string-name>
          .
          <article-title>A Trinity of Trials: Surrey's 2014 Attempts at Author Verification</article-title>
          .
          <source>Proceedings of PAN@CLEF</source>
          '
          <year>2014</year>
          . 2014 Argamon S.,
          <string-name>
            <surname>Whitelaw</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chase</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hota</surname>
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garg</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levitan</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>Stylistic text classi cation using functional lexical features</article-title>
          .
          <source>Journal of American society of information science and technology 58(6)</source>
          ,
          <fpage>802</fpage>
          <lpage>822</lpage>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Raghavan S.</given-names>
            ,
            <surname>Kovashka</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Mooney</surname>
          </string-name>
          <string-name>
            <surname>R. AUTHORSHIP</surname>
          </string-name>
          <article-title>ATTRIBUTION USING 1,0 PROBA- BILISTIC CONTEXT-FREE GRAMMARS</article-title>
          .
          <source>PROCEEDINGS of ACL 38 42</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Feng</surname>
            <given-names>V.W.</given-names>
          </string-name>
          , Hirst G.
          <article-title>authorship verification with entity coherence and other rich linguistic features</article-title>
          .
          <source>proceedings of clef 13</source>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Mccarthy P.M.</given-names>
            ,
            <surname>Lewis</surname>
          </string-name>
          <string-name>
            <given-names>G.A.</given-names>
            ,
            <surname>Dufty</surname>
          </string-name>
          <string-name>
            <given-names>D.F.</given-names>
            ,
            <surname>Mcnamara</surname>
          </string-name>
          <string-name>
            <surname>D.S.</surname>
          </string-name>
          <article-title>Analyzing writing styles with coh-metrix</article-title>
          .
          <source>Proceedings of FLAIRS'06</source>
          ,
          <fpage>764</fpage>
          -
          <lpage>769</lpage>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Baayen R.HAnalyzing Linguistic Data</surname>
          </string-name>
          : A Practical Introduction to Statistics using R. Cambridge: Cambridge University Press. .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Mosteller F.</given-names>
            ,
            <surname>Wallace</surname>
          </string-name>
          <string-name>
            <surname>D.L.</surname>
          </string-name>
          <article-title>Inference in an Authorship Problem</article-title>
          .
          <source>In Journal of the American Statistical Association</source>
          <volume>58</volume>
          ,
          <fpage>275</fpage>
          -
          <lpage>309</lpage>
          .
          <year>1964</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Burrows J.F.</surname>
          </string-name>
          <article-title>Not unless you ask nicely: The interpretative nexus between analysis and information</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>7</volume>
          (
          <issue>1</issue>
          ),
          <fpage>91</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Labbé C.</given-names>
            <surname>Inter-Textual Distance</surname>
          </string-name>
          and
          <article-title>Authorship Attribution : Corneille and Molière</article-title>
          .
          <source>Journal of Quantitative Linguistics</source>
          ,
          <fpage>213</fpage>
          -
          <lpage>231</lpage>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>Proceedings of the 21st international conference on Machine learning ACM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Hershey J.R.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Thomas J.W.</given-names>
            ,
            <surname>Olsen</surname>
          </string-name>
          <string-name>
            <given-names>P.A.</given-names>
            ,
            <surname>Rennie S.J. Variational</surname>
          </string-name>
          <article-title>KullbackLeibler divergence for Hidden Markov models</article-title>
          .
          <source>Proceedings of IEEE Workshop on Automatic Speech Recognition Understanding</source>
          ,
          <fpage>323</fpage>
          -
          <lpage>328</lpage>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Grieve J.</surname>
          </string-name>
          <article-title>Quantitative authorship attribution: An evaluation of techniques</article-title>
          .
          <source>Literary and linguistic computing 22(3)</source>
          ,
          <fpage>251</fpage>
          -
          <lpage>270</lpage>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Burrows J.F.</given-names>
            <surname>Delta</surname>
          </string-name>
          <article-title>: a Measure of Stylistic Difference and a Guide to Likely Authorship</article-title>
          .
          <source>Journal Literature Linguist Computing</source>
          .
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Savoy J. Attribution</surname>
          </string-name>
          <article-title>d'auteur par ensembles de séparateurs</article-title>
          .
          <source>Acte de la COnférence en Recherche d Information et Applications CORI</source>
          ,
          <fpage>A277</fpage>
          -
          <lpage>290</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Savoy J</surname>
          </string-name>
          .. Etude comparative de stratégies de sélection de prédicteurs pour l attribution d auteur.
          <source>Actdees la COnférence en Recherche d Information et Applications CORIA</source>
          ,
          <fpage>215</fpage>
          -
          <lpage>228</lpage>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Stamatatos E.</given-names>
            ,
            <surname>Fakotakis</surname>
          </string-name>
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Kokkinakis</surname>
          </string-name>
          <string-name>
            <surname>G</surname>
          </string-name>
          ..
          <article-title>Automatic text categorization in terms of genre and author</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>26</volume>
          ,
          <fpage>471</fpage>
          -
          <lpage>495</lpage>
          .
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Zhao</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zobel</surname>
            <given-names>J..</given-names>
          </string-name>
          <article-title>Searching with style: Authorship attribution in classic literature</article-title>
          .
          <source>Proceedings of the Australian Computer Science Conference</source>
          ,
          <volume>59</volume>
          -
          <fpage>68</fpage>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <fpage>1415</fpage>
          -
          <lpage>1424</lpage>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>