<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Empirical Method Exploring a Large Set of Features for Authorship Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Seifeddine Mechti</string-name>
          <email>mechtiseif@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rim Faiz</string-name>
          <email>Rim.faiz@ihec.rnu.tn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maher Jaoua</string-name>
          <email>maher.jaoua@fsegs.rnu.tn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lamia Hadrich Belguith</string-name>
          <email>l.belguith@fsegs.rnu.tn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LARODEC Laboratory</institution>
          ,
          <addr-line>IHEC Carthage</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LARODEC Laboratory, ISG of Tunis B.P.</institution>
          <addr-line>1088, 2000 Le Bardo</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>MIRACL Laboratory</institution>
          ,
          <addr-line>FSEGS, BP 1088, 3018 Sfax</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <fpage>89</fpage>
      <lpage>95</lpage>
      <abstract>
        <p>In this paper, we deal with the author identification issues of the document whose origin is unknown. To overcome these problems, we propose a new hybrid approach combining the statistical and stylistic analysis. Our introduced method is based on determining the lexical and syntactic features of the written text in order to identify the author of the document. These features are explored to build a machine learning process. We obtained promising results by relying on PAN@CLEF2014 English literature corpus. The experimental results are comparable to those obtained by the best state of the art methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recently, much more interest has been given to
a document authorship because of its application
in many domains, such as e-commerce, forensic
linguistics, etc. For instance, in the latter, author
identification can make many investigations
easier. Addi-tionally, the author identification task
is very useful in the plagiarism detection
process. Indeed, the probability of plagiarism
increases where two parts of a document are not
assigned to the same author. This task is planned
in PAN@CLEF 2016.In addition, forensic
analysis or that of the documents paternity for legal
purposes can contribute to several investigations
focusing on various linguistic characteristics. In
the literature, the automation of the author
identification task can draw on stylistic or statistical
attributes. Currently, machine learning techniques
have been used to infer attributes discriminating
the authors styles. In this context, we propose a
hybrid method combining the stylistic and
statistical attributes while relying on measure-ments of
inter-textual distances. In this paper, we present
the results of our experiments, using several
learning techniques. The objective of the work
proposed in
        <xref ref-type="bibr" rid="ref1">(Stamatatos et al., 2014)</xref>
        is to determine
from a specific list the au-thor who wrote a given
text. Thus, for this identification, we should focus
on open-set or closed-set classification problems.
In this context, we address a non-factoid question:
was a particular text written by a well-defined
author. This paper is organized as follows: In section
2, we depict the author identification approaches
proposed in literature. After that, we present our
hybrid method based on the statistical and stylistic
analysis. In section 3, we describe the machine
learning process. The fourth section shows the
experiments carried out together with the sever-al
applied tests and algorithms. Then, we compare
our simulation results with those obtained by
using other methods. Finally, we end up this paper
by some concluding remarks, and we propose
future research study.
2
      </p>
      <p>
        Related Work
In this section, we introduce author
identification methods classified essentially into three
categories. The first one is based on a stylistic
analysis. The second class contains techniques
relying on various statistical analyses. The third
category, which includes more recent methods, uses
machine learning algorithms. The basic idea of
the stylistic methods is the modeling of authors
from a linguistic point of view. For instance, we
can mention the works of Li et al.(2006), who
focused on topographic signs
        <xref ref-type="bibr" rid="ref2 ref4">(Li et al., 2006)</xref>
        ,
as well as the studies of Zheng et al. interested
in the co-occurrence of character n-grams
        <xref ref-type="bibr" rid="ref2 ref4">(Zheng
et al., 2006)</xref>
        . Other researchers were concerned
with the distribution of function words
(Vartapetience et al., 2014) or the lexical features
(Argamon et al., 2007). In another work, Raghavan et
al.2006 exploited grammars excluding the
probabilistic context to model the grammar used by
an author (Raghafan et al., 2010). Feng et al.
dealt with the syntactic functions of words and
their relationships in order to discern entity
coherence
        <xref ref-type="bibr" rid="ref8">(Feng et al, 2013)</xref>
        . Other surveys studied the
semantic dependency between the words of
written texts by means of taxonomies and thesaurus
(Maccarthy et al, 2006). Concerning statistical
methods, the first attempts emerged in
        <xref ref-type="bibr" rid="ref11">(Mostler et
Wallace., 1964)</xref>
        . They compared the occurrence
frequency of words, such as verbs, nouns,
articles, prepositions, conjunctions, and pronouns. In
the last few years, new methods, based on
various statistical tools, have been introduced in order
to discriminate between the potential authors of a
text. Among these methods, we can mention
intertextual distance (Labbe´,2014), the Delta method
(Savoy, 2013), the LDA distribution
        <xref ref-type="bibr" rid="ref15">(Blei et al,
2004)</xref>
        and the KL divergence distance (Herchey
et al., 2007). Indeed, (Labbe´,2003) Labb
demonstrated the effectiveness of intertextual distance in
quantifying the proximity between several texts
through a normalized index. Later, he revealed
the considerable Corneille contribution in plays
written by Moliere . In (Savoy, 2014), Buroows
proposed the Delta method in order to identify
the unknown documents author. He has
suggested selecting 40 to 150 most frequently used
words, especially the functional words, while
ignoring the punctuation signs. On the other hand,
in
        <xref ref-type="bibr" rid="ref17">(Grieve, 2007)</xref>
        , researchers demonstrated that
the Delta method could offer the best results. To
determine the document paternity, the authors
introduced a probabilistic model for author
identification by addressing several topics
        <xref ref-type="bibr" rid="ref18">(Savoy, 2012)</xref>
        .
At this level, each corpus is modeled as a
distribution of different themes; each theme represents
a specific distribution of words. From a machine
learning point of view
        <xref ref-type="bibr" rid="ref1">(Stamatatos et al, 2014)</xref>
        ,
author verification method can be either intrinsic or
extrinsic. In fact, intrinsic methods use both the
known and unknown texts of the problem , while
extrinsic methods utilize external documents of
other authors for each problem. The training
corpuses are represented in different forms. each text
is considered as a vector in a space with several
variables. In addition, a variety of powerful
algorithms, including discriminating analysis
        <xref ref-type="bibr" rid="ref19">(Stamatatos et al,2000)</xref>
        , SVM
        <xref ref-type="bibr" rid="ref20">(Lee et al., 2006)</xref>
        ,
decision trees (Zhao et Zobel, 2006), the neural
network (Argamon et al., 2007) and genetic
algorithms (Moreau et al., 2014), can be used to
construct a classification model. Finally, in a critical
study carried out by Baayen, the latter showed that
the stylistic methods revealed low performances
for short texts
        <xref ref-type="bibr" rid="ref10">(Baayen et al, 2008)</xref>
        . He also proved
that style can change over time or according to the
literary genre of the texts (poetry, novels, plays
...). Besides, despite their interesting results, the
statistical analysis ignores the writers style. In
this case, neither the vocabulary nor the theme of
the suspect document is taken into account.
Olson criticized some studies which convert the
language into mathematical equations (Herchey el al.,
2007). We choose hybridization to take advantage
of both the stylistic methods and statistics. On the
one hand, we use the lexical and syntactic analysis
to address the problem of mathematical
representation of a text (Section 3.1). On the other hand
we apply the Delta rule to gather the writers who
have almost the same style (section 3.2).
3
      </p>
      <p>The Proposed Method
The following section describes our hybrid
extrinsic method for tauthor identification. First, we
will extract the different types of stylistic features
(syntactic, lexical and characters) and then the
ngrams. In the second step of the authors selection,
we will focus on the delta method. The third step
will be reserved for the application of the learning
model.
3.1</p>
      <p>Feature Extraction
In order to extract features, also called style
markers, we use the tools of the Apache Open Library .
These robust tools allow segmenting the texts and
analyzing the necessary syntax and semantics. For
the lexical features, obtained by frequency
calculations, the text is regarded as a set of tokens. We
distinguish between the number of words that
appear only once, the ratio V/N (V is the size of the
hapaxes , and N is the length of the text), the
average sentence length and the number of words
which appear twice. Then, we extract the
lexical features, such as the number of nouns, verbs,
adjectives, adverbs and prepositions. In features
extraction, we consider the text as a simple
sequence of characters. We also take into account
the information concerning the frequencies of
letters, punctuation marks (number of colons,
exclamation marks, question marks and commas),
uppercase and lowercase characters as well as the
numerical and alphabetical characters. Finally, we
resort to the n-grams classes. We make n vary
from 3 up to 7 characters. In fact, a small n=3 and
a large one are respectively used to capture the
syllables and the punctuation marks and to produce
the words.
3.2</p>
    </sec>
    <sec id="sec-2">
      <title>Authors Selection</title>
      <p>In this step, we select authors in order to prepare
the machine learning process. We apply the Delta
method on the candidate document and all authors
existing documents. For each unknown author,
we select the three authors who have the lowest
Delta measure with the candidate document.
We note that different verification problems
(different folders) may share documents of the
same authors. For example, the known document
of folder EN001 and that of folder EN002 may be
written by the same author. Then, we calculate the
distance based on the standardized frequencies
(Z-score) between two documents Q and A using
the following equation:
D(Q, Aj ) = M1 Pim=1)[Zscore(tiq)
Zscore(tij )]</p>
      <p>Where</p>
      <p>Zscore(tij ) = tfrij sdm(ie)an(i)
tf rij is the frequency of the term ti in the
document Dj, mean represents the average, and
sdi denotes the standard deviation. Finally, we use
the number of the most common terms between
100 and 400 words.
3.3</p>
    </sec>
    <sec id="sec-3">
      <title>Application of a Classification Model</title>
      <p>We perform the machine learning process based
on the documents of the candidate author and
those of the three already selected authors. We use
the Weka tool in order to represent the known
author and the other three authors by an ARFF file
with the already extracted features. In addition,
we apply a learning algorithm on this File in order
to get a prediction model where the known texts
are the positive examples, and documents written
by other authors represent the negative examples.
This algorithm is determined after applying a test
on multiple classifiers, such as: SVM, decision
trees, Naive Bayes, decision table and KNN. We
choose the algorithm that gives the best
performance.
4</p>
      <p>
        Basic characteristichs of our Hybrid
method
Hybridization has always been considered as an
interesting track because it overcomes the
limitations of the combined approaches. The following
table 1 presents a comparison between the
different methods of author identfication: Verification
Model: The intrinsic models use the texts within
a verification problem
        <xref ref-type="bibr" rid="ref2 ref4">(Zheng et al.,2006)</xref>
        ,
        <xref ref-type="bibr" rid="ref8">(Feng
et al.,2013)</xref>
        ,
        <xref ref-type="bibr" rid="ref11">(Mostler et Wallace.,1964)</xref>
        . In other
studies (Labbe´, 2014),
        <xref ref-type="bibr" rid="ref18">(Savoy, 2012)</xref>
        Labb and
Savoy consider other texts written by different
authors and attempt to transform the verification task
into a binary classification problem. However,
According to PAN@CLEF 2014 and PAN@CLEF
2015, extrinsic models give better results than
intrinsic ones
        <xref ref-type="bibr" rid="ref1">(Stamatatos et al.,2014)</xref>
        .
Classifcation: There are two methods of classification:
eager methods, using a supervised learning
        <xref ref-type="bibr" rid="ref2 ref4">(Zheng
et al.,2006)</xref>
        ,
        <xref ref-type="bibr" rid="ref8">(Feng et al,2013)</xref>
        , and lazy methods
that do not apply any algorithm
        <xref ref-type="bibr" rid="ref11">(Mostler et
Wallace, 1964)</xref>
        , (Labbe´, 2003),
        <xref ref-type="bibr" rid="ref18">(Savoy, 2012)</xref>
        . In
this paper, we resort to supervised learning using
SVM. Attribution Paradigm: There are two
attribution paradigms
        <xref ref-type="bibr" rid="ref19">(Stamatatos et al, 2000)</xref>
        . In the
instance based representation each document is
represented separately
        <xref ref-type="bibr" rid="ref8">(Feng et al., 2013)</xref>
        ,
        <xref ref-type="bibr" rid="ref13">(Labb,
2003)</xref>
        ,
        <xref ref-type="bibr" rid="ref18">(Savoy, 2012)</xref>
        . However, the profile based
paradigm tries to construct an author profile
using all texts of the corresponding author.
(Author profile)
        <xref ref-type="bibr" rid="ref2 ref4">(Zheng et al.,2006)</xref>
        ,
        <xref ref-type="bibr" rid="ref11">(Mostler et
wallace, 1964)</xref>
        . Indeed, we choose the hybrid of the
two paradigms, a representation for each
document which are then combined in a single author
profile. Text analysis: Most of the proposed
studies used the part of speech POS tagging
        <xref ref-type="bibr" rid="ref2 ref4">(Zheng
et al., 2006)</xref>
        ,
        <xref ref-type="bibr" rid="ref11">(Mostler et wallace, 1964)</xref>
        because
of the availability of taggers. Some other
studies resorted to intertextual distance
        <xref ref-type="bibr" rid="ref13">(Labb, 2003)</xref>
        ,
        <xref ref-type="bibr" rid="ref18">(Savoy, 2012)</xref>
        . However, our method combines
statistical and stylistic features (sections 3.1, 3.2).
The following section describes our hybrid
extrinsic method for tauthor identification. First, we
will extract the different types of stylistic features
(syntactic, lexical and characters) and then the
ngrams. In the second step of the authors selection,
we will focus on the delta method. The third step
will be reserved for the application of the learning
model.
5
      </p>
      <p>Experiments and Evaluation
In this section, we show the experimental results
of our method for authors identification. We first
describe the corpus and the evaluation measures.
Then, we depict the performance of our system in
the identification of anonymous authors.
5.1</p>
      <sec id="sec-3-1">
        <title>Corpus</title>
        <p>The training corpus includes a set of folders from
the PAN@CLEF 2014 computational conference.
Each folder contains up to five machine learning
documents and a test document in English. The
length of the documents varies from a few hundred
to a few thousand words. It is worth noting that
the experiments were carried with the 200 existing
problems in the corpus.
5.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Performance Measures</title>
        <p>
          To assess our results, we adopt the the C@1
measure
          <xref ref-type="bibr" rid="ref24">(Penas et Rodrigo., 2011)</xref>
          AUC and Recall
metrics.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Recall</title>
        <p>In the context of classification tasks, the terms
true positives, true negatives, false positives and
false negatives are used to compare the given
classification of an item :</p>
        <p>TN / True Negative: case was negative and
predicted negative
TP / True Positive: case was positive and
predicted positive
FN / False Negative: case was positive but
predicted negative</p>
        <p>FP / False Positive: case was negative but
predicted positive</p>
        <p>Recall= VP/(VP+FN)</p>
      </sec>
      <sec id="sec-3-4">
        <title>C@1 score</title>
        <p>The evaluation score C@1 has the advantage of
considering the documents that the classifier is
unable to assign to a category. For each problem,
each score greater than 0.5 is considered as a
positive response, while that below 0.5 is viewed
as a negative response. Therefore, the test
document does not belong to this author. Nevertheless,
all the scores equal to 0.5 correspond to the
outstanding problems where the answer will be ”I
dont know ”. Then, c @ 1 is defined as follows:
c@1 = (1/n)*(nc+(nu*nc/n))</p>
        <p>
          <xref ref-type="bibr" rid="ref24">(Penas et Rodrigo, 2011)</xref>
          where:
n = number of problems ;
nc = number of correct answers ;
nu = number of unanswered problems
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>AUC score</title>
        <p>The AUC is a common evaluation metric for
binary classification problems.</p>
        <p>the figure 1 present an exmample of AUC plot.
Consider a plot of the true positive rate vs the false
positive rate as the threshold value for classifying
an item as 0 or is increased from 0 to 1: if the
classifier is very good, the true positive rate will
increase quickly and the area under the curve will
be close to 1. If the classifier is no better than
random guessing, the true positive rate will increase
linearly with the false positive rate and the area
under the curve will be around 0.5.
The histograms below reveal the experiments
conducted to obtain the best possible documents
paternity:</p>
        <p>Figure 2 (a) shows the accuracy reached with a
test set of six well known classifiers in order to
select the best one. This accuracy is determined with
all the stylistic features and the n-gram features
(variation of n between 3 and 7). The best
accuracy has been achieved by the use of the SVM
algorithm with a slight advantage vis-a-vis the Nave
Bayes classifier. Figure1 (b) show that the
character features are not very powerful in
determining the authors of documents whose origin is
unknown. On the other hand, the syntactic features
give encouraging results. Combining these
features provides better performance than the use of
each feature separately. Figure 1(c) depicts the
c@1 histogram of the n-grams method. It
highlights that accuracy reaches a maximum for n= 3
and 4. Then, it decreases with the increase of n..
After that, we use the most frequent numbers of m
words (between 100 and 400). Figure 1(d) shows
that the best c@1 measure is given based on the
SVM algorithm with 250 words. This measure
decreases with the increase of words number.</p>
        <p>Figure 3 demonstrates that combining the
syntactic features, the lexical ones and the 3 grams
brings encouraging results in a machine learning
process. However, the use of the Delta method
to classify documents gives better results than the
stylistic method by which we obtain 0.54 c@1
score. In the hybrid evaluation step, this result
is somewhat improved by using the Delta method
during the step of authors selection. These
measures reach high value with the choice of the most
frequent 250 words. Our system has proven its
effectiveness when the statistical and the stylistic
analysis were combined. Thus, we were able to
find the unknown author of a document in 59%
of the studied cases. In Table 2, we compare the
performance of our method with those of the
winner of PAN@CLEF 2014 competitive conference
for the English essays. From table 2, we notice</p>
        <sec id="sec-3-5-1">
          <title>Baseline 0.53 0.5 0.54</title>
        </sec>
        <sec id="sec-3-5-2">
          <title>Our method 0.68 0.74 0.6</title>
          <p>
            <xref ref-type="bibr" rid="ref25">Frery et al.(2014</xref>
            )
0.71
0.72
0.72
that our method is useful in terms of recall. It
noticeably outperforms
            <xref ref-type="bibr" rid="ref25">Frery et al.(2014</xref>
            ), although
C@1 and AUC still need to be further improved.
Based on PAN@CLEF 2014 competitive
conference
            <xref ref-type="bibr" rid="ref1">(Stamatatos et al, 2014)</xref>
            , our classification
results are so encouraging, which shows the
effectiveness of our method. Focusing on the step of
selecting the attributes, we are trying to improve
our results in our future work.
6
          </p>
          <p>Conclusion
In this paper, we have focused on author
identification problem by applying a machine learning
process. Indeed, the introduced hybrid method is
essentially based on using both stylistic and
statistical characteristics. The experimental results reveal
the efficiency of the proposed technique in which
we use the Delta method prior to syntactic and
lexical features as well as n-grams and character
features. We have also proven through the carried
experiments how the heterogeneous models allowed
us to detect appropriately the document paternity.
In future research study, we will try to make our
technique more effective by utilizing text
extraction tool. The main objective will be to show that
the authors style is clear in some specific parts of
the written text.</p>
          <p>We are also planning to apply our approach on
German, Spanish and Greek corpora to show the
efficiency of our method in multilingual context.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Stamatatos</given-names>
            <surname>Efstathios</surname>
          </string-name>
          , Daelemans Walter, Verhoeven Ben , Potthast Martin,
          <string-name>
            <given-names>Stein Benno</given-names>
            , Juola Patrick, Miguel A.
            <surname>Sanchez-Perez</surname>
          </string-name>
          , and
          <string-name>
            <surname>Barrn-Cedeo Alberto</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Overview of the Author Identification Task at CLEF</article-title>
          . England.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Li</given-names>
            <surname>Jiexun</surname>
          </string-name>
          , Zheng Rong and
          <string-name>
            <given-names>Chen</given-names>
            <surname>Hsinchun</surname>
          </string-name>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>From fingerprint to writeprint</article-title>
          .
          <source>Communication ACM</source>
          <volume>49</volume>
          (
          <issue>4</issue>
          ),
          <fpage>7682</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Zheng</given-names>
            <surname>Rong</surname>
          </string-name>
          , Li Jiexun,
          <source>Chen Hsinchun and Huang Zan</source>
          .
          <year>2006</year>
          .
          <article-title>A framework for authorship identification of online messages: Writing style features and classification techniques</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>57</volume>
          (
          <issue>3</issue>
          ),
          <fpage>378</fpage>
          -
          <lpage>393</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Vartapetiance</given-names>
            <surname>Anna</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gillam</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A Trinity of Trials: Surreys 2014 Attempts at Author Verification</article-title>
          .
          <source>Proceedings of PAN@CLEF2014.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          2007.
          <article-title>Stylistic text classication using functional lexical features</article-title>
          <source>Journal of American society of information science and technology 58(6)</source>
          ,
          <fpage>802822</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Raghavan</given-names>
            <surname>Sindhu</surname>
          </string-name>
          , Kovashka Adriana and
          <string-name>
            <given-names>Mooney</given-names>
            <surname>Raymond</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Authorship attribution using probabilistic context free grammars</article-title>
          .
          <source>Proceedings of ACL10</source>
          ,
          <fpage>3842</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Feng</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Wei</surname>
            and
            <given-names>Hirst</given-names>
          </string-name>
          <string-name>
            <surname>Graeme</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Authorship verification with entity coherence and other rich linguistic features</article-title>
          .
          <source>Proceedings of CLEF13.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>David</surname>
          </string-name>
          and
          <string-name>
            <surname>Mcnamara</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Danielle</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Analyzing writing styles with coh-metrix</article-title>
          .
          <source>Proceedings of FLAIRS06</source>
          ,
          <fpage>764769</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Baayen R.</given-names>
            <surname>Harald</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Analyzing Linguistic Data</article-title>
          . A Practical Introduction to Statistics using R.Cambridge, Cambridge University Press, Cambridge.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Mosteller</given-names>
            <surname>Frederick</surname>
          </string-name>
          and
          <string-name>
            <given-names>Wallace</given-names>
            <surname>David</surname>
          </string-name>
          .
          <year>1964</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>Inference in an Authorship Problem</source>
          ,
          <year>1964</year>
          .
          <source>In Journal of the American Statistical Association</source>
          , Volume
          <volume>58</volume>
          ,
          <source>Issue</source>
          <volume>302</volume>
          ,
          <fpage>275</fpage>
          -
          <lpage>309</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Labb</given-names>
            <surname>Cyril</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Intertextual Distance and Authorship Attribution. Corneille and Molire</article-title>
          , In: Journal of Quantitative Linguistics, , pp.
          <fpage>213</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Burrows</given-names>
            <surname>John</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Delta: a Measure of Stylistic Difference and a Guide to Likely Authorship</article-title>
          ,
          <source>In Journal Lit Linguist Computing.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Blei</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>David</surname>
            , and
            <given-names>Jordan I. Michael.</given-names>
          </string-name>
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Hershey R. John</surname>
            , Olsen A. Peder and
            <given-names>Rennie J.</given-names>
          </string-name>
          <string-name>
            <surname>Steven</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Variational Kullback Leibler divergence for Hidden Markov models</article-title>
          .
          <source>IEEE Workshop on Automatic Speech Recognition and Under standing.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Grieve</given-names>
            <surname>Jack</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Quantitative authorship attribution: An evaluation of techniques</article-title>
          .
          <source>Literary and linguistic computing</source>
          ,
          <volume>22</volume>
          (
          <issue>3</issue>
          ),.
          <fpage>251</fpage>
          -
          <lpage>270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Savoy</given-names>
            <surname>Jacques</surname>
          </string-name>
          .
          <year>2012</year>
          . Etude comparative de stratgies de slection de prdicteurs pour lattribution dauteur,
          <source>COnfrence en Recherche dInformation et Applications CORIA</source>
          .
          <fpage>215</fpage>
          -
          <lpage>228</lpage>
          , France.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Stamatatos</given-names>
            <surname>Efstathios</surname>
          </string-name>
          , Fakotakis Nikos and
          <string-name>
            <given-names>Kokkinakis</given-names>
            <surname>George</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Automatic text categorization in terms of genre and author</article-title>
          ,
          <source>Computational Linguistics</source>
          , Volume
          <volume>26</volume>
          ,.
          <volume>471</volume>
          -
          <fpage>495</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Lee C.</given-names>
            <surname>Min</surname>
          </string-name>
          , Mani Inderjeet, Verhagen Marc,
          <string-name>
            <given-names>Wellner</given-names>
            <surname>Ben</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Pustejovsky</given-names>
            <surname>James</surname>
          </string-name>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>Machine learning of temporal relations</article-title>
          .
          <source>In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics</source>
          .
          <fpage>753</fpage>
          -
          <lpage>760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Zhao</surname>
            <given-names>Ying</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>Zobel</given-names>
            <surname>Justin</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Searching with style: Authorship attribution in classic literature</article-title>
          ,
          <source>In Proceedings of the Thirtieth Australian Computer Science</source>
          Conference ACM Press,
          <fpage>59</fpage>
          -
          <lpage>68</lpage>
          ,Australia.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          2014.
          <article-title>Author Verification: Exploring a Large setof Parameters using a Genetic Algorithm Notebook for PAN at CLEF 2014</article-title>
          . England.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Peas</given-names>
            <surname>Anselmo</surname>
          </string-name>
          and Rodrigo lvaro.
          <year>2011</year>
          .
          <article-title>A Simple Measure to Assess Nonresponse</article-title>
          .
          <source>In Proceedings Of the 49th Annual Meeting of the Association for Computational Linguistics</source>
          , Vol.
          <volume>1</volume>
          ,
          <fpage>1415</fpage>
          -
          <lpage>1424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Frery</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Largeron</given-names>
            <surname>Christine</surname>
          </string-name>
          , and
          <string-name>
            <given-names>JuganaruMathieu</given-names>
            <surname>Mihaela</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>UJM at CLEF in Author Identification</article-title>
          .
          <source>PAN@CLEF2014</source>
          . England.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>