An Empirical Method Exploring a Large Set of Features for Authorship Identification Seifeddine Mechti Rim Faiz LARODEC Laboratory LARODEC Laboratory ISG of Tunis B.P.1088, 2000 Le Bardo, Tunisia IHEC Carthage, Tunisia Maher Jaoua Lamia Hadrich Belguith MIRACL Laboratory MIRACL Laboratory FSEGS, BP 1088, 3018 Sfax, Tunisia FSEGS, BP 1088, 3018 Sfax, Tunisia Abstract have been used to infer attributes discriminating the authors styles. In this context, we propose a In this paper, we deal with the author iden- hybrid method combining the stylistic and statis- tification issues of the document whose tical attributes while relying on measure-ments of origin is unknown. To overcome these inter-textual distances. In this paper, we present problems, we propose a new hybrid ap- the results of our experiments, using several learn- proach combining the statistical and stylis- ing techniques. The objective of the work pro- tic analysis. Our introduced method is posed in (Stamatatos et al., 2014) is to determine based on determining the lexical and syn- from a specific list the au-thor who wrote a given tactic features of the written text in or- text. Thus, for this identification, we should focus der to identify the author of the docu- on open-set or closed-set classification problems. ment. These features are explored to In this context, we address a non-factoid question: build a machine learning process. We was a particular text written by a well-defined au- obtained promising results by relying on thor. This paper is organized as follows: In section PAN@CLEF2014 English literature cor- 2, we depict the author identification approaches pus. The experimental results are compa- proposed in literature. After that, we present our rable to those obtained by the best state of hybrid method based on the statistical and stylistic the art methods. analysis. In section 3, we describe the machine learning process. The fourth section shows the 1 Introduction experiments carried out together with the sever-al Recently, much more interest has been given to applied tests and algorithms. Then, we compare a document authorship because of its application our simulation results with those obtained by us- in many domains, such as e-commerce, forensic ing other methods. Finally, we end up this paper linguistics, etc. For instance, in the latter, author by some concluding remarks, and we propose fu- identification can make many investigations eas- ture research study. ier. Addi-tionally, the author identification task 2 Related Work is very useful in the plagiarism detection pro- cess. Indeed, the probability of plagiarism in- In this section, we introduce author identifica- creases where two parts of a document are not as- tion methods classified essentially into three cat- signed to the same author. This task is planned egories. The first one is based on a stylistic anal- in PAN@CLEF 2016.In addition, forensic anal- ysis. The second class contains techniques rely- ysis or that of the documents paternity for legal ing on various statistical analyses. The third cat- purposes can contribute to several investigations egory, which includes more recent methods, uses focusing on various linguistic characteristics. In machine learning algorithms. The basic idea of the literature, the automation of the author iden- the stylistic methods is the modeling of authors tification task can draw on stylistic or statistical from a linguistic point of view. For instance, we attributes. Currently, machine learning techniques can mention the works of Li et al.(2006), who 89 focused on topographic signs (Li et al., 2006), other authors for each problem. The training cor- as well as the studies of Zheng et al. interested puses are represented in different forms. each text in the co-occurrence of character n-grams (Zheng is considered as a vector in a space with several et al., 2006). Other researchers were concerned variables. In addition, a variety of powerful al- with the distribution of function words (Vartape- gorithms, including discriminating analysis (Sta- tience et al., 2014) or the lexical features (Arga- matatos et al,2000), SVM (Lee et al., 2006), deci- mon et al., 2007). In another work, Raghavan et sion trees (Zhao et Zobel, 2006), the neural net- al.2006 exploited grammars excluding the prob- work (Argamon et al., 2007) and genetic algo- abilistic context to model the grammar used by rithms (Moreau et al., 2014), can be used to con- an author (Raghafan et al., 2010). Feng et al. struct a classification model. Finally, in a critical dealt with the syntactic functions of words and study carried out by Baayen, the latter showed that their relationships in order to discern entity coher- the stylistic methods revealed low performances ence (Feng et al, 2013). Other surveys studied the for short texts (Baayen et al, 2008). He also proved semantic dependency between the words of writ- that style can change over time or according to the ten texts by means of taxonomies and thesaurus literary genre of the texts (poetry, novels, plays (Maccarthy et al, 2006). Concerning statistical ...). Besides, despite their interesting results, the methods, the first attempts emerged in (Mostler et statistical analysis ignores the writers style. In Wallace., 1964). They compared the occurrence this case, neither the vocabulary nor the theme of frequency of words, such as verbs, nouns, arti- the suspect document is taken into account. Ol- cles, prepositions, conjunctions, and pronouns. In son criticized some studies which convert the lan- the last few years, new methods, based on vari- guage into mathematical equations (Herchey el al., ous statistical tools, have been introduced in order 2007). We choose hybridization to take advantage to discriminate between the potential authors of a of both the stylistic methods and statistics. On the text. Among these methods, we can mention inter- one hand, we use the lexical and syntactic analysis textual distance (Labbé,2014), the Delta method to address the problem of mathematical represen- (Savoy, 2013), the LDA distribution (Blei et al, tation of a text (Section 3.1). On the other hand 2004) and the KL divergence distance (Herchey we apply the Delta rule to gather the writers who et al., 2007). Indeed, (Labbé,2003) Labb demon- have almost the same style (section 3.2). strated the effectiveness of intertextual distance in quantifying the proximity between several texts 3 The Proposed Method through a normalized index. Later, he revealed the considerable Corneille contribution in plays The following section describes our hybrid extrin- written by Moliere . In (Savoy, 2014), Buroows sic method for tauthor identification. First, we proposed the Delta method in order to identify will extract the different types of stylistic features the unknown documents author. He has sug- (syntactic, lexical and characters) and then the n- gested selecting 40 to 150 most frequently used grams. In the second step of the authors selection, words, especially the functional words, while ig- we will focus on the delta method. The third step noring the punctuation signs. On the other hand, will be reserved for the application of the learning in (Grieve, 2007), researchers demonstrated that model. the Delta method could offer the best results. To 3.1 Feature Extraction determine the document paternity, the authors in- troduced a probabilistic model for author identifi- In order to extract features, also called style mark- cation by addressing several topics (Savoy, 2012). ers, we use the tools of the Apache Open Library . At this level, each corpus is modeled as a distri- These robust tools allow segmenting the texts and bution of different themes; each theme represents analyzing the necessary syntax and semantics. For a specific distribution of words. From a machine the lexical features, obtained by frequency calcu- learning point of view (Stamatatos et al, 2014), au- lations, the text is regarded as a set of tokens. We thor verification method can be either intrinsic or distinguish between the number of words that ap- extrinsic. In fact, intrinsic methods use both the pear only once, the ratio V/N (V is the size of the known and unknown texts of the problem , while hapaxes , and N is the length of the text), the av- extrinsic methods utilize external documents of erage sentence length and the number of words which appear twice. Then, we extract the lexi- 90 cal features, such as the number of nouns, verbs, by other authors represent the negative examples. adjectives, adverbs and prepositions. In features This algorithm is determined after applying a test extraction, we consider the text as a simple se- on multiple classifiers, such as: SVM, decision quence of characters. We also take into account trees, Naive Bayes, decision table and KNN. We the information concerning the frequencies of let- choose the algorithm that gives the best perfor- ters, punctuation marks (number of colons, excla- mance. mation marks, question marks and commas), up- percase and lowercase characters as well as the nu- 4 Basic characteristichs of our Hybrid merical and alphabetical characters. Finally, we method resort to the n-grams classes. We make n vary from 3 up to 7 characters. In fact, a small n=3 and Hybridization has always been considered as an a large one are respectively used to capture the syl- interesting track because it overcomes the limita- lables and the punctuation marks and to produce tions of the combined approaches. The following the words. table 1 presents a comparison between the differ- ent methods of author identfication: Verification 3.2 Authors Selection Model: The intrinsic models use the texts within In this step, we select authors in order to prepare a verification problem (Zheng et al.,2006), (Feng the machine learning process. We apply the Delta et al.,2013), (Mostler et Wallace.,1964). In other method on the candidate document and all authors studies (Labbé, 2014), (Savoy, 2012) Labb and existing documents. For each unknown author, Savoy consider other texts written by different au- we select the three authors who have the lowest thors and attempt to transform the verification task Delta measure with the candidate document. into a binary classification problem. However, Ac- We note that different verification problems cording to PAN@CLEF 2014 and PAN@CLEF (different folders) may share documents of the 2015, extrinsic models give better results than in- same authors. For example, the known document trinsic ones (Stamatatos et al.,2014). Classifca- of folder EN001 and that of folder EN002 may be tion: There are two methods of classification: ea- written by the same author. Then, we calculate the ger methods, using a supervised learning (Zheng distance based on the standardized frequencies et al.,2006), (Feng et al,2013), and lazy methods (Z-score) between two documents Q and A using that do not apply any algorithm (Mostler et Wal- the following equation: lace, 1964), (Labbé, 2003), (Savoy, 2012). In 1 Pm this paper, we resort to supervised learning using D(Q, Aj ) = M i=1 )[Zscore(tiq ) Zscore(tij )] SVM. Attribution Paradigm: There are two attri- bution paradigms (Stamatatos et al, 2000). In the Where instance based representation each document is tf r mean(i) Zscore(tij ) = ij sd(i) represented separately (Feng et al., 2013), (Labb, 2003), (Savoy, 2012). However, the profile based tf rij is the frequency of the term ti in the paradigm tries to construct an author profile us- document Dj, mean represents the average, and ing all texts of the corresponding author. (Au- sdi denotes the standard deviation. Finally, we use thor profile) (Zheng et al.,2006), (Mostler et wal- the number of the most common terms between lace, 1964). Indeed, we choose the hybrid of the 100 and 400 words. two paradigms, a representation for each docu- ment which are then combined in a single author 3.3 Application of a Classification Model profile. Text analysis: Most of the proposed stud- We perform the machine learning process based ies used the part of speech POS tagging (Zheng on the documents of the candidate author and et al., 2006), (Mostler et wallace, 1964) because those of the three already selected authors. We use of the availability of taggers. Some other stud- the Weka tool in order to represent the known au- ies resorted to intertextual distance (Labb, 2003), thor and the other three authors by an ARFF file (Savoy, 2012). However, our method combines with the already extracted features. In addition, statistical and stylistic features (sections 3.1, 3.2). we apply a learning algorithm on this File in order The following section describes our hybrid extrin- to get a prediction model where the known texts sic method for tauthor identification. First, we are the positive examples, and documents written will extract the different types of stylistic features 91 Table 1: Author identifcation methods Author (s) Verification Model Classification Attribution paradigms text analysis (Zheng., 2006) extrinsic Eager Author profile POS taggig (Feng, 2013) extrinsic Eager Instance based POS taggig (Wallace et al., 2011) extrinsic Lazy Author profile POS tagging (Labbé ,2014) intrinsic Lazy Instance based Intertextual distance (Savoy et al., 2013) intrinsic Lazy Instance based Delta method Our Method extrinsic eager Hybrid Delta metod + POS taggig (syntactic, lexical and characters) and then the n- FP / False Positive: case was negative but pre- grams. In the second step of the authors selection, dicted positive we will focus on the delta method. The third step will be reserved for the application of the learning model. Recall= VP/(VP+FN) 5 Experiments and Evaluation C@1 score The evaluation score C@1 has the advantage of In this section, we show the experimental results considering the documents that the classifier is of our method for authors identification. We first unable to assign to a category. For each problem, describe the corpus and the evaluation measures. each score greater than 0.5 is considered as a Then, we depict the performance of our system in positive response, while that below 0.5 is viewed the identification of anonymous authors. as a negative response. Therefore, the test docu- ment does not belong to this author. Nevertheless, 5.1 Corpus all the scores equal to 0.5 correspond to the The training corpus includes a set of folders from outstanding problems where the answer will be ”I the PAN@CLEF 2014 computational conference. dont know ”. Then, c @ 1 is defined as follows: Each folder contains up to five machine learning documents and a test document in English. The length of the documents varies from a few hundred c@1 = (1/n)*(nc+(nu*nc/n)) to a few thousand words. It is worth noting that (Penas et Rodrigo, 2011) the experiments were carried with the 200 existing problems in the corpus. where: n = number of problems ; 5.2 Performance Measures nc = number of correct answers ; To assess our results, we adopt the the C@1 mea- nu = number of unanswered problems sure (Penas et Rodrigo., 2011) AUC and Recall metrics. AUC score Recall The AUC is a common evaluation metric for In the context of classification tasks, the terms binary classification problems. true positives, true negatives, false positives and the figure 1 present an exmample of AUC plot. false negatives are used to compare the given Consider a plot of the true positive rate vs the false classification of an item : positive rate as the threshold value for classifying an item as 0 or is increased from 0 to 1: if the TN / True Negative: case was negative and classifier is very good, the true positive rate will predicted negative increase quickly and the area under the curve will TP / True Positive: case was positive and pre- be close to 1. If the classifier is no better than ran- dicted positive dom guessing, the true positive rate will increase FN / False Negative: case was positive but linearly with the false positive rate and the area predicted negative under the curve will be around 0.5. 92 sures reach high value with the choice of the most frequent 250 words. Our system has proven its effectiveness when the statistical and the stylistic analysis were combined. Thus, we were able to find the unknown author of a document in 59% of the studied cases. In Table 2, we compare the performance of our method with those of the win- ner of PAN@CLEF 2014 competitive conference for the English essays. From table 2, we notice Table 2: Comparison between our performances and Frery el. 2014 Baseline Our method Frery et al.(2014) Figure 1: Example of AUC plot C@1 0.53 0.68 0.71 Recall 0.5 0.74 0.72 AUC 0.54 0.6 0.72 5.3 Result Analysis The histograms below reveal the experiments con- that our method is useful in terms of recall. It no- ducted to obtain the best possible documents pa- ticeably outperforms Frery et al.(2014), although ternity: C@1 and AUC still need to be further improved. Figure 2 (a) shows the accuracy reached with a Based on PAN@CLEF 2014 competitive confer- test set of six well known classifiers in order to se- ence (Stamatatos et al, 2014), our classification lect the best one. This accuracy is determined with results are so encouraging, which shows the effec- all the stylistic features and the n-gram features tiveness of our method. Focusing on the step of (variation of n between 3 and 7). The best accu- selecting the attributes, we are trying to improve racy has been achieved by the use of the SVM al- our results in our future work. gorithm with a slight advantage vis-a-vis the Nave Bayes classifier. Figure1 (b) show that the char- 6 Conclusion acter features are not very powerful in determin- In this paper, we have focused on author identifica- ing the authors of documents whose origin is un- tion problem by applying a machine learning pro- known. On the other hand, the syntactic features cess. Indeed, the introduced hybrid method is es- give encouraging results. Combining these fea- sentially based on using both stylistic and statisti- tures provides better performance than the use of cal characteristics. The experimental results reveal each feature separately. Figure 1(c) depicts the the efficiency of the proposed technique in which c@1 histogram of the n-grams method. It high- we use the Delta method prior to syntactic and lex- lights that accuracy reaches a maximum for n= 3 ical features as well as n-grams and character fea- and 4. Then, it decreases with the increase of n.. tures. We have also proven through the carried ex- After that, we use the most frequent numbers of m periments how the heterogeneous models allowed words (between 100 and 400). Figure 1(d) shows us to detect appropriately the document paternity. that the best c@1 measure is given based on the In future research study, we will try to make our SVM algorithm with 250 words. This measure de- technique more effective by utilizing text extrac- creases with the increase of words number. tion tool. The main objective will be to show that Figure 3 demonstrates that combining the syn- the authors style is clear in some specific parts of tactic features, the lexical ones and the 3 grams the written text. brings encouraging results in a machine learning We are also planning to apply our approach on process. 