An Empirical Method Exploring a Large Set of Features for Authorship
                              Identification
                  Seifeddine Mechti                                     Rim Faiz
                LARODEC Laboratory                                 LARODEC Laboratory
     ISG of Tunis B.P.1088, 2000 Le Bardo, Tunisia                 IHEC Carthage, Tunisia
             mechtiseif@gmail.com                                Rim.faiz@ihec.rnu.tn

                   Maher Jaoua                                   Lamia Hadrich Belguith
                MIRACL Laboratory                                  MIRACL Laboratory
          FSEGS, BP 1088, 3018 Sfax, Tunisia                 FSEGS, BP 1088, 3018 Sfax, Tunisia
           maher.jaoua@fsegs.rnu.tn                           l.belguith@fsegs.rnu.tn


                     Abstract                                have been used to infer attributes discriminating
                                                             the authors styles. In this context, we propose a
     In this paper, we deal with the author iden-            hybrid method combining the stylistic and statis-
     tification issues of the document whose                 tical attributes while relying on measure-ments of
     origin is unknown. To overcome these                    inter-textual distances. In this paper, we present
     problems, we propose a new hybrid ap-                   the results of our experiments, using several learn-
     proach combining the statistical and stylis-            ing techniques. The objective of the work pro-
     tic analysis. Our introduced method is                  posed in (Stamatatos et al., 2014) is to determine
     based on determining the lexical and syn-               from a specific list the au-thor who wrote a given
     tactic features of the written text in or-              text. Thus, for this identification, we should focus
     der to identify the author of the docu-                 on open-set or closed-set classification problems.
     ment. These features are explored to                    In this context, we address a non-factoid question:
     build a machine learning process. We                    was a particular text written by a well-defined au-
     obtained promising results by relying on                thor. This paper is organized as follows: In section
     PAN@CLEF2014 English literature cor-                    2, we depict the author identification approaches
     pus. The experimental results are compa-                proposed in literature. After that, we present our
     rable to those obtained by the best state of            hybrid method based on the statistical and stylistic
     the art methods.                                        analysis. In section 3, we describe the machine
                                                             learning process. The fourth section shows the
1    Introduction                                            experiments carried out together with the sever-al
Recently, much more interest has been given to               applied tests and algorithms. Then, we compare
a document authorship because of its application             our simulation results with those obtained by us-
in many domains, such as e-commerce, forensic                ing other methods. Finally, we end up this paper
linguistics, etc. For instance, in the latter, author        by some concluding remarks, and we propose fu-
identification can make many investigations eas-             ture research study.
ier. Addi-tionally, the author identification task
                                                             2   Related Work
is very useful in the plagiarism detection pro-
cess. Indeed, the probability of plagiarism in-              In this section, we introduce author identifica-
creases where two parts of a document are not as-            tion methods classified essentially into three cat-
signed to the same author. This task is planned              egories. The first one is based on a stylistic anal-
in PAN@CLEF 2016.In addition, forensic anal-                 ysis. The second class contains techniques rely-
ysis or that of the documents paternity for legal            ing on various statistical analyses. The third cat-
purposes can contribute to several investigations            egory, which includes more recent methods, uses
focusing on various linguistic characteristics. In           machine learning algorithms. The basic idea of
the literature, the automation of the author iden-           the stylistic methods is the modeling of authors
tification task can draw on stylistic or statistical         from a linguistic point of view. For instance, we
attributes. Currently, machine learning techniques           can mention the works of Li et al.(2006), who


                                                        89
focused on topographic signs (Li et al., 2006),              other authors for each problem. The training cor-
as well as the studies of Zheng et al. interested            puses are represented in different forms. each text
in the co-occurrence of character n-grams (Zheng             is considered as a vector in a space with several
et al., 2006). Other researchers were concerned              variables. In addition, a variety of powerful al-
with the distribution of function words (Vartape-            gorithms, including discriminating analysis (Sta-
tience et al., 2014) or the lexical features (Arga-          matatos et al,2000), SVM (Lee et al., 2006), deci-
mon et al., 2007). In another work, Raghavan et              sion trees (Zhao et Zobel, 2006), the neural net-
al.2006 exploited grammars excluding the prob-               work (Argamon et al., 2007) and genetic algo-
abilistic context to model the grammar used by               rithms (Moreau et al., 2014), can be used to con-
an author (Raghafan et al., 2010). Feng et al.               struct a classification model. Finally, in a critical
dealt with the syntactic functions of words and              study carried out by Baayen, the latter showed that
their relationships in order to discern entity coher-        the stylistic methods revealed low performances
ence (Feng et al, 2013). Other surveys studied the           for short texts (Baayen et al, 2008). He also proved
semantic dependency between the words of writ-               that style can change over time or according to the
ten texts by means of taxonomies and thesaurus               literary genre of the texts (poetry, novels, plays
(Maccarthy et al, 2006). Concerning statistical              ...). Besides, despite their interesting results, the
methods, the first attempts emerged in (Mostler et           statistical analysis ignores the writers style. In
Wallace., 1964). They compared the occurrence                this case, neither the vocabulary nor the theme of
frequency of words, such as verbs, nouns, arti-              the suspect document is taken into account. Ol-
cles, prepositions, conjunctions, and pronouns. In           son criticized some studies which convert the lan-
the last few years, new methods, based on vari-              guage into mathematical equations (Herchey el al.,
ous statistical tools, have been introduced in order         2007). We choose hybridization to take advantage
to discriminate between the potential authors of a           of both the stylistic methods and statistics. On the
text. Among these methods, we can mention inter-             one hand, we use the lexical and syntactic analysis
textual distance (Labbé,2014), the Delta method             to address the problem of mathematical represen-
(Savoy, 2013), the LDA distribution (Blei et al,             tation of a text (Section 3.1). On the other hand
2004) and the KL divergence distance (Herchey                we apply the Delta rule to gather the writers who
et al., 2007). Indeed, (Labbé,2003) Labb demon-             have almost the same style (section 3.2).
strated the effectiveness of intertextual distance in
quantifying the proximity between several texts              3     The Proposed Method
through a normalized index. Later, he revealed
the considerable Corneille contribution in plays             The following section describes our hybrid extrin-
written by Moliere . In (Savoy, 2014), Buroows               sic method for tauthor identification. First, we
proposed the Delta method in order to identify               will extract the different types of stylistic features
the unknown documents author. He has sug-                    (syntactic, lexical and characters) and then the n-
gested selecting 40 to 150 most frequently used              grams. In the second step of the authors selection,
words, especially the functional words, while ig-            we will focus on the delta method. The third step
noring the punctuation signs. On the other hand,             will be reserved for the application of the learning
in (Grieve, 2007), researchers demonstrated that             model.
the Delta method could offer the best results. To
                                                             3.1    Feature Extraction
determine the document paternity, the authors in-
troduced a probabilistic model for author identifi-          In order to extract features, also called style mark-
cation by addressing several topics (Savoy, 2012).           ers, we use the tools of the Apache Open Library .
At this level, each corpus is modeled as a distri-           These robust tools allow segmenting the texts and
bution of different themes; each theme represents            analyzing the necessary syntax and semantics. For
a specific distribution of words. From a machine             the lexical features, obtained by frequency calcu-
learning point of view (Stamatatos et al, 2014), au-         lations, the text is regarded as a set of tokens. We
thor verification method can be either intrinsic or          distinguish between the number of words that ap-
extrinsic. In fact, intrinsic methods use both the           pear only once, the ratio V/N (V is the size of the
known and unknown texts of the problem , while               hapaxes , and N is the length of the text), the av-
extrinsic methods utilize external documents of              erage sentence length and the number of words
                                                             which appear twice. Then, we extract the lexi-


                                                        90
cal features, such as the number of nouns, verbs,            by other authors represent the negative examples.
adjectives, adverbs and prepositions. In features            This algorithm is determined after applying a test
extraction, we consider the text as a simple se-             on multiple classifiers, such as: SVM, decision
quence of characters. We also take into account              trees, Naive Bayes, decision table and KNN. We
the information concerning the frequencies of let-           choose the algorithm that gives the best perfor-
ters, punctuation marks (number of colons, excla-            mance.
mation marks, question marks and commas), up-
percase and lowercase characters as well as the nu-          4   Basic characteristichs of our Hybrid
merical and alphabetical characters. Finally, we                 method
resort to the n-grams classes. We make n vary
from 3 up to 7 characters. In fact, a small n=3 and          Hybridization has always been considered as an
a large one are respectively used to capture the syl-        interesting track because it overcomes the limita-
lables and the punctuation marks and to produce              tions of the combined approaches. The following
the words.                                                   table 1 presents a comparison between the differ-
                                                             ent methods of author identfication: Verification
3.2 Authors Selection                                        Model: The intrinsic models use the texts within
In this step, we select authors in order to prepare          a verification problem (Zheng et al.,2006), (Feng
the machine learning process. We apply the Delta             et al.,2013), (Mostler et Wallace.,1964). In other
method on the candidate document and all authors             studies (Labbé, 2014), (Savoy, 2012) Labb and
existing documents. For each unknown author,                 Savoy consider other texts written by different au-
we select the three authors who have the lowest              thors and attempt to transform the verification task
Delta measure with the candidate document.                   into a binary classification problem. However, Ac-
We note that different verification problems                 cording to PAN@CLEF 2014 and PAN@CLEF
(different folders) may share documents of the               2015, extrinsic models give better results than in-
same authors. For example, the known document                trinsic ones (Stamatatos et al.,2014). Classifca-
of folder EN001 and that of folder EN002 may be              tion: There are two methods of classification: ea-
written by the same author. Then, we calculate the           ger methods, using a supervised learning (Zheng
distance based on the standardized frequencies               et al.,2006), (Feng et al,2013), and lazy methods
(Z-score) between two documents Q and A using                that do not apply any algorithm (Mostler et Wal-
the following equation:                                      lace, 1964), (Labbé, 2003), (Savoy, 2012). In
                         1 Pm                                this paper, we resort to supervised learning using
D(Q, Aj )       =       M    i=1 )[Zscore(tiq )
Zscore(tij )]                                                SVM. Attribution Paradigm: There are two attri-
                                                             bution paradigms (Stamatatos et al, 2000). In the
  Where                                                      instance based representation each document is
                tf r  mean(i)
  Zscore(tij ) = ij sd(i)                                    represented separately (Feng et al., 2013), (Labb,
                                                             2003), (Savoy, 2012). However, the profile based
tf rij is the frequency of the term ti in the                paradigm tries to construct an author profile us-
document Dj, mean represents the average, and                ing all texts of the corresponding author. (Au-
sdi denotes the standard deviation. Finally, we use          thor profile) (Zheng et al.,2006), (Mostler et wal-
the number of the most common terms between                  lace, 1964). Indeed, we choose the hybrid of the
100 and 400 words.                                           two paradigms, a representation for each docu-
                                                             ment which are then combined in a single author
3.3 Application of a Classification Model                    profile. Text analysis: Most of the proposed stud-
We perform the machine learning process based                ies used the part of speech POS tagging (Zheng
on the documents of the candidate author and                 et al., 2006), (Mostler et wallace, 1964) because
those of the three already selected authors. We use          of the availability of taggers. Some other stud-
the Weka tool in order to represent the known au-            ies resorted to intertextual distance (Labb, 2003),
thor and the other three authors by an ARFF file             (Savoy, 2012). However, our method combines
with the already extracted features. In addition,            statistical and stylistic features (sections 3.1, 3.2).
we apply a learning algorithm on this File in order          The following section describes our hybrid extrin-
to get a prediction model where the known texts              sic method for tauthor identification. First, we
are the positive examples, and documents written             will extract the different types of stylistic features


                                                        91
                                   Table 1: Author identifcation methods

         Author (s)          Verification Model   Classification        Attribution paradigms         text analysis
      (Zheng., 2006)               extrinsic          Eager                  Author profile            POS taggig
        (Feng, 2013)               extrinsic          Eager                 Instance based             POS taggig
    (Wallace et al., 2011)         extrinsic          Lazy                   Author profile           POS tagging
       (Labbé ,2014)              intrinsic          Lazy                  Instance based             Intertextual
                                                                                                         distance
    (Savoy et al., 2013)          intrinsic             Lazy                 Instance based           Delta method
       Our Method                 extrinsic             eager                    Hybrid               Delta metod +
                                                                                                       POS taggig


(syntactic, lexical and characters) and then the n-         FP / False Positive: case was negative but pre-
grams. In the second step of the authors selection,         dicted positive
we will focus on the delta method. The third step
will be reserved for the application of the learning
model.                                                                     Recall= VP/(VP+FN)

5     Experiments and Evaluation                            C@1 score
                                                            The evaluation score C@1 has the advantage of
In this section, we show the experimental results           considering the documents that the classifier is
of our method for authors identification. We first          unable to assign to a category. For each problem,
describe the corpus and the evaluation measures.            each score greater than 0.5 is considered as a
Then, we depict the performance of our system in            positive response, while that below 0.5 is viewed
the identification of anonymous authors.                    as a negative response. Therefore, the test docu-
                                                            ment does not belong to this author. Nevertheless,
5.1 Corpus
                                                            all the scores equal to 0.5 correspond to the
The training corpus includes a set of folders from          outstanding problems where the answer will be ”I
the PAN@CLEF 2014 computational conference.                 dont know ”. Then, c @ 1 is defined as follows:
Each folder contains up to five machine learning
documents and a test document in English. The
length of the documents varies from a few hundred                      c@1 = (1/n)*(nc+(nu*nc/n))
to a few thousand words. It is worth noting that                        (Penas et Rodrigo, 2011)
the experiments were carried with the 200 existing
problems in the corpus.                                     where:
                                                            n = number of problems ;
5.2 Performance Measures                                    nc = number of correct answers ;
To assess our results, we adopt the the C@1 mea-            nu = number of unanswered problems
sure (Penas et Rodrigo., 2011) AUC and Recall
metrics.                                                    AUC score
Recall                                                      The AUC is a common evaluation metric for
   In the context of classification tasks, the terms        binary classification problems.
true positives, true negatives, false positives and            the figure 1 present an exmample of AUC plot.
false negatives are used to compare the given               Consider a plot of the true positive rate vs the false
classification of an item :                                 positive rate as the threshold value for classifying
                                                            an item as 0 or is increased from 0 to 1: if the
  TN / True Negative: case was negative and                 classifier is very good, the true positive rate will
predicted negative                                          increase quickly and the area under the curve will
TP / True Positive: case was positive and pre-              be close to 1. If the classifier is no better than ran-
dicted positive                                             dom guessing, the true positive rate will increase
FN / False Negative: case was positive but                  linearly with the false positive rate and the area
predicted negative                                          under the curve will be around 0.5.


                                                       92
                                                              sures reach high value with the choice of the most
                                                              frequent 250 words. Our system has proven its
                                                              effectiveness when the statistical and the stylistic
                                                              analysis were combined. Thus, we were able to
                                                              find the unknown author of a document in 59%
                                                              of the studied cases. In Table 2, we compare the
                                                              performance of our method with those of the win-
                                                              ner of PAN@CLEF 2014 competitive conference
                                                              for the English essays. From table 2, we notice

                                                              Table 2: Comparison between our performances
                                                              and Frery el. 2014

                                                                           Baseline   Our method     Frery et al.(2014)
         Figure 1: Example of AUC plot                            C@1       0.53         0.68              0.71
                                                                  Recall     0.5         0.74              0.72
                                                                  AUC       0.54          0.6              0.72
5.3 Result Analysis
The histograms below reveal the experiments con-              that our method is useful in terms of recall. It no-
ducted to obtain the best possible documents pa-              ticeably outperforms Frery et al.(2014), although
ternity:                                                      C@1 and AUC still need to be further improved.
   Figure 2 (a) shows the accuracy reached with a             Based on PAN@CLEF 2014 competitive confer-
test set of six well known classifiers in order to se-        ence (Stamatatos et al, 2014), our classification
lect the best one. This accuracy is determined with           results are so encouraging, which shows the effec-
all the stylistic features and the n-gram features            tiveness of our method. Focusing on the step of
(variation of n between 3 and 7). The best accu-              selecting the attributes, we are trying to improve
racy has been achieved by the use of the SVM al-              our results in our future work.
gorithm with a slight advantage vis-a-vis the Nave
Bayes classifier. Figure1 (b) show that the char-             6     Conclusion
acter features are not very powerful in determin-             In this paper, we have focused on author identifica-
ing the authors of documents whose origin is un-              tion problem by applying a machine learning pro-
known. On the other hand, the syntactic features              cess. Indeed, the introduced hybrid method is es-
give encouraging results. Combining these fea-                sentially based on using both stylistic and statisti-
tures provides better performance than the use of             cal characteristics. The experimental results reveal
each feature separately. Figure 1(c) depicts the              the efficiency of the proposed technique in which
c@1 histogram of the n-grams method. It high-                 we use the Delta method prior to syntactic and lex-
lights that accuracy reaches a maximum for n= 3               ical features as well as n-grams and character fea-
and 4. Then, it decreases with the increase of n..            tures. We have also proven through the carried ex-
After that, we use the most frequent numbers of m             periments how the heterogeneous models allowed
words (between 100 and 400). Figure 1(d) shows                us to detect appropriately the document paternity.
that the best c@1 measure is given based on the               In future research study, we will try to make our
SVM algorithm with 250 words. This measure de-                technique more effective by utilizing text extrac-
creases with the increase of words number.                    tion tool. The main objective will be to show that
   Figure 3 demonstrates that combining the syn-              the authors style is clear in some specific parts of
tactic features, the lexical ones and the 3 grams             the written text.
brings encouraging results in a machine learning                 We are also planning to apply our approach on
process. However, the use of the Delta method                 German, Spanish and Greek corpora to show the
to classify documents gives better results than the           efficiency of our method in multilingual context.
stylistic method by which we obtain 0.54 c@1
score. In the hybrid evaluation step, this result
is somewhat improved by using the Delta method
during the step of authors selection. These mea-


                                                         93
                             Figure 2: author identification histograms


                                                       References

                                                       Stamatatos Efstathios,        Daelemans Walter,
                                                       Verhoeven Ben , Potthast Martin, Stein Benno,
                                                       Juola Patrick, Miguel A. Sanchez-Perez, and
                                                       Barrn-Cedeo Alberto. 2014. Overview of the
                                                       Author Identification Task at CLEF. England.
                                                       Li Jiexun, Zheng Rong and Chen Hsinchun. 2006.
                                                       From fingerprint to writeprint. Communication
                                                       ACM 49(4), 7682.
                                                       Zheng Rong, Li Jiexun, Chen Hsinchun and
                                                       Huang Zan. 2006. A framework for authorship
                                                       identification of online messages: Writing style
                                                       features and classification techniques. Journal of
                                                       the American Society for Information Science
Figure 3: The C@1 Performance of different fea-        and Technology, 57(3), 378-393.
tures according to words number                        Vartapetiance Anna and Gillam Lee. 2014. A
                                                       Trinity of Trials: Surreys 2014 Attempts at Author
                                                       Verification. Proceedings of PAN@CLEF2014.
                                                       Argamon Shlomo, Whitelaw Casey, Chase Paul,
                                                       Hota S. Raj, Garg Navendu and Levitan Shlomo.
                                                       2007. Stylistic text classication using functional
                                                       lexical features Journal of American society
                                                       of information science and technology 58(6),
                                                       802822.
                                                       Raghavan Sindhu, Kovashka Adriana and Mooney
                                                       Raymond. 2010. Authorship attribution using
                                                       probabilistic context free grammars. Proceedings
                                                       of ACL10, 3842.


                                                  94
Feng V. Wei and Hirst Graeme. 2013. Authorship               tralian Computer Science Conference ACM Press,
verification with entity coherence and other rich            59-68,Australia.
linguistic features.Proceedings of CLEF13.                   Moreau Erwan, Jayapal Arun, and Vogel Carl.
Mccarthy M. Philip, Lewis A. Gwyneth, Dufty F.               2014. Author Verification: Exploring a Large
David and Mcnamara S. Danielle. 2006. Analyz-                setof Parameters using a Genetic Algorithm
ing writing styles with coh-metrix. Proceedings              Notebook for PAN at CLEF 2014. England.
of FLAIRS06, 764769.                                         Peas Anselmo and Rodrigo lvaro. 2011. A Simple
Baayen R. Harald. 2008. Analyzing Linguistic                 Measure to Assess Nonresponse. In Proceedings
Data. A Practical Introduction to Statistics using           Of the 49th Annual Meeting of the Association
R.Cambridge, Cambridge University Press, Cam-                for Computational Linguistics, Vol.1, 1415-1424.
bridge.                                                      Frery Jordan, Largeron Christine, and Juganaru-
Mosteller Frederick and Wallace David. 1964.                 Mathieu Mihaela. 2014. UJM at CLEF in Author
Inference in an Authorship Problem,1964. In                  Identification. PAN@CLEF2014. England.
Journal of the American Statistical Association,
Volume 58, Issue 302, 275-309.
Labb Cyril. 2003. Intertextual Distance and
Authorship Attribution. Corneille and Molire, In:
Journal of Quantitative Linguistics, , pp. 213-231.
Burrows John. 2002. Delta: a Measure of Stylistic
Difference and a Guide to Likely Authorship, In
Journal Lit Linguist Computing.
Blei M. David, and Jordan I. Michael. 2004.
Variational methods for the Dirichlet process.
In Proceedings of the twenty first international
conference on Machine learning ACM.
Hershey R. John, Olsen A. Peder and Rennie
J. Steven. 2007. Variational Kullback Leibler
divergence for Hidden Markov models. IEEE
Workshop on Automatic Speech Recognition and
Under standing.
Grieve Jack. 2007. Quantitative authorship
attribution: An evaluation of techniques. Literary
and linguistic computing, 22(3),.251-270.
Savoy Jacques. 2012. Etude comparative de
stratgies de slection de prdicteurs pour lattribution
dauteur, COnfrence en Recherche dInformation et
Applications CORIA. 215-228, France.
Stamatatos Efstathios, Fakotakis Nikos and
Kokkinakis George.        2000.     Automatic text
categorization in terms of genre and author,
Computational Linguistics, Volume 26,.471-495.
Lee C. Min, Mani Inderjeet, Verhagen Marc,
Wellner Ben, and Pustejovsky James. 2006.
Machine learning of temporal relations.            In
Proceedings of the 21st International Conference
on Computational Linguistics and the 44th annual
meeting of the Association for Computational
Linguistics. 753-760.
Zhao Ying, and Zobel Justin. 2007. Searching
with style: Authorship attribution in classic
literature, In Proceedings of the Thirtieth Aus-


                                                        95