-

Authorship Verification via k-Nearest Neighbor Estimation

0 Fraunhofer Institute for Secure Information Technology SIT Rheinstrasse 75 , 64295 Darmstadt , Germany 1 Oren Halvani , Martin Steinebach, and Ralf Zimmermann

2013

In this paper we describe our k-Nearest Neighbor (k-NN) based Authorship Verification method for the Author Identification (AI) task of the PAN 2013 challenge. The method follows an ensemble classification technique based on the combination of suitable feature categories. For each chosen feature category we apply a k-NN classifier to calculate a style deviation score between the training documents of the true author A and the document from an author, who claims to be A. Depending on the score and a given threshold, a decision for or against the alleged author is generated and stored into a list. Afterwards, the final decision regarding the alleged authorship is determined through a majority vote among all decisions within this list. The method provides a number of benefits as for instance the independence of linguistic resources like ontologies, thesauruses or even language models. A further benefit is the language-independency among different Indo-European languages as the approach is applicable on languages like Spanish, English, Greek or German. Another benefit is the low runtime of the method, since there is no need for deep linguistic processing like POS-tagging, chunking or parsing. Moreover, the method can be extended or modified for instance by replacing the classification function, the threshold or the underlying features including their parameters (e.g. n-Gram sizes or the amount of feature frequencies). In addition to the PAN 2013 AI-training-corpus, where we gained an overall accuracy score of 80%, we also evaluated the algorithm on our own dataset with an accuracy of 77.50%.

Authorship Verification (AV ) is a subdiscipline of Authorship Analysis [8, Page: 3], where the task is to verify if a given document DA? of an alleged author A has been really written by A or not. In order to verify the authorship of DA? some components are mandatorily required. Generally, this includes a set of sample documents from A, a set of features (concrete "style markers") and at least one classification method that makes the decision regarding the alleged authorship. From the perspective of Machine Learning, AV appears to be an instance of binary classification problems, since the expected output of AV system regarding the alleged authorship is either "yes" or "no" (formally A or :A). On a second glance, however, it turns out that AV must describe an one-class-classification problem, [3] since A is the only target class to be distinguished among all other classes (authors), which can be theoretically infinite.

In contrast to the related discipline Authorship Attribution (AA), which is also a subfield of Authorship Analysis, AV has not gained the same popularity in research. This statement can be observed from the number of publications within scientific gateways/portals as for instance ACM, CiteCeer or Springer in Table 1.

Portal Number of publications according to both disciplines Source

ACM Digital Library 50 for AV vs. 391 for AA [4] CiteCeerX 27 for AV vs. 377 for AA [1] SpringerLink 25 for AV vs. 267 for AA [7]

One possible reason for this is that AV can be more complicated to handle as AA, since it represents an open-class problem by default, while AA can be both an open or a closed-set problem. Open-set means that the true author of a given document is not included in a set, while in a closed-set the opposite is the case. Upendra et al. mentioned in [6] that most of the research focusses on the closed-set case, which is easier to solve as the open-set case problem. The latter one, however, is closely related to AV , since a decision must be formulated that descibes if an alleged author is accepted to be the true author or not.

In this paper we propose an intuitive scheme for AV , based on our previous work in [2, Page: 99]. The core of the method relies on the Nearest Neighbor one-classclassification technique described in [9, Page: 69]. The main idea of our approach is to compute and compare style deviation scores depending on so-called feature category combinations between DA? and a set of training documents of A. Afterwards the overall decision regarding the alleged authorship is determined, based on a majority vote among all single decisions (according to the chosen feature category combinations). 2

Notation

To increase the readability of this paper we decided to follow the notation in Table 2, which we have also defined in our previous work [2]. The notation will be used both in tables and in continuous text.

Symbol Decription

A Denotes the true author.

DA? Denotes a document of an alleged author, who claims to be A.

DA Denotes a set f D1A; D2A; : : : g containing sample documents of A. F Denotes a feature category consisting of f f1; f2; : : : g features. F Denotes a feature vector (e.g. FA? as the feature vector represenatation of DA?).

Features

In this section we describe which features we used in our AV system for the PAN task. Typically features are taken from linguistic layers, which can be understood as structured units of an unstructured text. Stamatatos [8] and Loose [5] for instance mention the following linguistic layers, which can be looked-up for features: 1. Phoneme layer: This layer includes phoneme-based features that can be won out of texts by using (pronouncing) dictionaries, e.g. IPA. 2. Character layer: This layer includes character-based features as for instance prefixes, suffixes or letter n-Grams. 3. Lexical layer: This layer includes token-based features as for instance function words or Pos-Tags (Part-Of-Speech Tags). 4. Syntactic layer: This layer includes syntax-based features as for instance constituents (e.g. nominal phrases) or collocations. 5. Semantic Layer: This layer includes semantic-based features especially semantic relations (homonyms, hyponymous, synonymys, meronyms, etc. ).

Instead of utilizing features according to these layers, we decided to define the term "feature categories" that allows us to distinguish more precisely among single features. For example, considering the features a,o and u, the appropriate feature category would be letters. Hence, a feature category can be seen as concrete concept of features belonging to a specific (more abstract) linguistic layer. Table 3 lists all feature categories that have been used by our AV system.

Fi Feature category Examples F1 Punctuation marks -, _, ,, ., :, ;, (), [], {} F2 Letters a, b, c, : : : , x, y, z, A, B, C, : : : , X, Y, Z F3 Letter n-Grams en, er, th, ted, ough F4 Token k-prefixes [removed] [re], [confirmed] F5 Token k-suffixes [extended] [ed], [available] F6 Function words and, or, the, on, in, while F7 Function word n-Grams (which, is, or), (that, on, the, above) F8 Sentence k-beginning function words (The : : : ), (Since the : : : ) F9 Token n-Grams (such that), (it could not) F10 Token n-Gram lengths (of the) ( 2,3 ), (are known as) F11 Token n-Gram k-prefixes (has been more) (ha, be, mo) F12 Token n-Gram k-suffixes (has been more) (as, en, re) Remarks n 2 f2; 3; 4g k 2 f2; 3g k 2 f2; 4g k 2 f3; 4g k 2 f1; 2g k 2 f2; 3g ( 3,5,2 ) k 2 f2; 4g n = 3; k = 2 n = 3; k = 2 [con] [able] The majority of the feature categories in Table 3 can be parameterized in various ways, e.g. n-Gram sizes, k-prefix/suffix lengths or the number of entries within dictionarybased feature categories (for instance F6). Moreover, the frequencies of the extracted features for each Fi can be adjusted, such that one can decide to use all or just the top t extracted features. As a consequence, there is a dependency of these parameters when constructing feature vectors, which also affects our AV system regarding accuracy and runtime. For the PAN competition, however, we could not find the best parameters, due to the fact that the parameter-space is extremely huge and can only be searched through by special techniques as for instance evolutionary algorithms. Such techniques, however, are beyond the scope of our approach and are subject of future work. 4

Verification via k-NN Estimation

In this section we give a detailed description of our AV system. For overview purposes we divided the entire process into three subsections, where we first explain what kind of preprocessing we perform to clean and normalize the data, then we describe the core verification algorithm and finally we show how our system determines the decision regarding the alleged authorship. 4.1

Preprocessing

Given DA? as an input document which is going to be verified regarding the alleged authorship and DA as the set of sample documents of the true author, the first step is to perform preprocessing on all documents in terms of normalization and noise reduction. Normalization is essential to treat all documents uniquely, while noise reduction is important to increase the quality of extracted features. We normalize the documents by several operations as for instance substituting successive blanks by only one or by replacing diacritics by their appropriate representatives, e.g. "ñ" "n". Moreover, we equalize the lengths of the training documents. For this we concatenate all DiA 2 DA into a single document, which is then splitted up into ` + 1 (near) equal-sized training documents D1A; D2A; : : : ; D`+1A. Regarding noise reduction we remove citations, markup tags, formulas and non-words as for instance "DFT-Eq.(6.1)". 4.2

Verification

Our verification algorithm is based on a k-NN classifier which, by its nature, is only able to work on numeric values. Therefore, we first must construct the feature vector representations FA? from DA? and F1A; F2A; : : : ; F`+1A from all the training documents. Beforehand, we have to choose a set of feature categories that should be taken into account. This set is denoted by F and is always odd-numbered, otherwise the later majority-vote determination would be inapplicable. Each feature vector consists of exactly n = jFij values that represent relative frequencies within its underlying document, according to Fi 2 F. It should be noted that in our scheme it can never be the case that a feature vector is mixed (includes features among more than one category). After all feature vectors have been constructed, we calculate style deviation scores between FA? and each Fj A for all Fi 2 F. A style deviation score describes a value sj 2 R +[f0g and is calculated through a distance function dist( ; ) as for example the euclidean distance. Hence, the deviation score can be written formally as sj = dist(FA?; Fj ). A The lower sj is, the lower is the style deviation between FA? and Fj A. Therefore sj = 0 is interpretated as absolutely identical, while sj ! 1 is interpretated as totally different in terms of style. The resulting deviation scores are stored together with their corresponding feature vectors into a list, which is sorted (according to the scores) by ascending order:

Outer_Distances = (s1; F1A); (s2; F2A); : : : ; (s`+1; F`+1A)

Next, we pick s1 (which represents the smallest style deviation) and its corresponding feature vector F1A from the first tuple within Outer_Distances and discard the remaining list. After that, we calculate again style deviation scores but this time between F1A and each FiA 2 f F2A; F3A; : : : ; F`+1A g. The scores sj0 = dist(F1A; FiA) are then stored (without their feature vectors) into the following list:

Inner_Distances = s20 ; s30 ; : : : ; s`0+1 4.3

Decision

To obtain a single decision (A or :A) regarding a chosen feature category Fi we first calculate the average of the k scores within Inner_Distances and denote it as avg_dist. Here, k refers to the k nearest neighbours of F1A. Then, we define the acceptance decision as follows:

s1 avg_dist

If the acceptance decision holds such that the threshold is not exceeded, A is accepted as the true author of DA?, otherwise not. It should be noted that in all our experiments we chose = 1, since we have observed quite stable results with it in comparison with other thresholds. After all the decisions concerning F have been determined, we store them into a list denoted by Decisions, which can look like this, for example:

Decisions =

A; A; A; :A; :A

Finally we apply a majority-vote over Decisions and treat the resulting outcome as the overall decision. In the case of the above example the AV system will judge that DA? has been written by A. 5

Experiments

In our experiments we consider the PAN corpus ("PAN 2013 AI-training-corpus"), which has been provided for the Author Identification task. It consists of 189 training documents across the languages Greek (GR), English (EN) and Spanish (SP) and is divided into the datasets jCGRj = 20, jCEN j = 10 and jCSP j = 5.

For our AV system we tested ten different combinations of feature categories and the parameters listed in Table 4. The results are calculated in simple and weighted accuracies, where a simple accuracy is calculated as: ? = ?CGR + ?CEN + : : : jCGR [ CEN [ : : : j

Number of correct answers per dataset Ci , with ?Ci = Total number of documents per dataset Ci while in contrast, a weighted accuracy is calculated as: (weighted)? = jCGRj ?CGR + jCEN j ?CEN + : : :

jCGR [ CEN [ : : : j

It should be highlighted that, due to the low accuracies of some feature categories, not all Fi appear within the feature category combinations in Table 5.

F f F1; F3; F9 g f F1; F3; F7; F8; F12 g f F1; F2; F3 g f F1; F4; F9 g f F1; F3; F9; F11; F12 g 80 % f F7; F9; F11 g 60 % f F3; F6; F7; F11; F12 g 60 % f F2; F5; F6 g 80 % f F3; F7; F9 g 20 % f F4; F6; F7 g 40 % ?CSP ?CEN ?CGR 80 % 90 % 70 % 80 % 80 % 80 % 65 % 75 % 80 % 80 % 55 % 71:67 % 80 % 80 % 60 % 73:33 % 80 % 55 % 71:67 % 60 % 50 % 56:67 % 50 % 55 % 55 % 40 % 40 % 53:33 % 70 % 50 % 46:67 % 40 % 60 % 46:67 % ? (weighted) ? 77:14 % 71:42 % 65:71 % 68:57 % 65:71 % 54:28 % 54:28 % 45:71 % 51:43 % 51:43 %

In order to test the language independence of our method, we decided to extend the PAN corpus with an additional german dataset denoted by CDE . The CDE dataset consists of 400 documents, extracted from various German-speaking e-Books from the publicly available "Online Book Catalog" of the Gutenberg Project website: CDE includes 40 problem-cases, where each case includes nine sample documents of A and one document from an author, who claims to be A. The average size of one document is 18 KByte, where each document comprises 4KByte of noisy data. Table 6 summarizes the results according to the extended PAN corpus.

F ?CSP ?CEN ?CGR ?CDE f F1; F3; F9 g 80 % 90 % 70 % 67:5 % 76:86 % f F1; F3; F7; F8; F12 g 80 % 80 % 65 % 77:5 % 75:63 % f F1; F2; F3 g 80 % 80 % 55 % 75 % 72:5 % f F1; F4; F9 g 80 % 80 % 60 % 62:5 % 70:63 % f F1; F3; F9; F11; F12 g 80 % 80 % 55 % 62:5 % 69:38 % f F7; F9; F11 g 60 % 60 % 50 % 60 % 57:5 % f F3; F6; F7; F11; F12 g 60 % 50 % 55 % 62:5 % 56:88 % f F2; F5; F6 g 80 % 40 % 40 % 65 % 56:26 % f F3; F7; F9 g 20 % 70 % 50 % 67:5 % 51:86 % f F4; F6; F7 g 40 % 40 % 60 % 60 % 50 %

In our experiments we observed several phenomenons, where the most important one is that the parameter settings seems to have a very strong influence on the results. In Table 7 we list alternative parameter settings for the best feature category combination F = f F1; F3; F9 g regarding the PAN corpus (without CDE ).

Conclusion & future work

In this paper we presented an intuitive scheme to automatically verify alleged authorships of text documents. Our method provides several benefits as for instance language independence (at least for Indo-European languages) and also independence of linguistic resources (e.g. ontologies, thesauruses, language models, etc.). A further benefit is the low runtime of the method, since there is no need for deep linguistic processing like POS-tagging, chunking or parsing. Another benefit is that the involved components within the method can be replaced easily as for example the style deviation method, the acceptance-threshold or the feature categories including their parameters. Moreover, the components can be extended or combined e.g. through ensemble-techniques (combination of several style deviation methods).

Unfortunately, besides benefits our approach includes several pitfalls, too. One of the biggest challenges, for example, is the inscrutability of the methods parameterspace, due to the fact that the number of configuration settings is near infinite. Such settings include for instance the number of extracted features from a specific category or feature-specific parameters (e.g. n-Gram sizes, k-prefix/suffix lengths, etc.). Therefore, it must be investigated how to find the very best corpus-independent parameter settings. This could be accomplished, for instance, with evolutionary algorithms.

Another challenge that remains unsolved is the influence of the text topic, which can distort the verification results. For some feature categories, e.g. F3 (Letter n-Grams), we encountered the presence of content words which are mostly topic related. Yet it is not clear if the verification results remains the same if training documents derive from different sources (e.g. e-Mails, scientific papers, social networks messages, movie reviews, etc.), which may differ heavily in register, genre and style. Hence, a more detailed investigation is required in the near future, which involves special corpora construction. 7

Acknowledgements

This work was supported by the CASED Center for Advanced Security Research Darmstadt, Germany funded by the German state government of Hesse under the LOEWE programme (http://www.CASED.de). 4. Library, T.A.D.: Association for Computing Machinery, New York, NY, USA (2013), http://dl.acm.org 5. Loose, F.: Paarweise Autorenschaftsverifikation von kurzen Texten. Master thesis,

Bauhaus-Universität Weimar, Fakultät Medien, Medieninformatik (May 2011) 6. Sapkota, U., Solorio, T.: Sub-Profiling by Linguistic Dimensions to Solve the Authorship Attribution Task. In: Forner, P., Karlgren, J., Womser-Hacker, C. (eds.) CLEF (Online Working Notes/Labs/Workshop) (2012), http://dblp.uni-trier.de/db/conf/clef/clef2012w.html#SapkotaS12 7. SpringerLink: Springer-Verlag GmbH, Heidelberg, Germany (2013),

http://www.springerlink.com 8. Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. J. Am. Soc. Inf. Sci.

Technol. 60( 3 ), 538–556 (Mar 2009), http://dx.doi.org/10.1002/asi.v60:3 9. Tax, D.M.J.: One-Class Classification. Concept Learning In the Absence of Counter-Examples. Ph.D. thesis, Delft University of Technology (2001), http://www-ict.ewi.tudelft.nl/˜davidt/thesis.pdf

1. CiteSeer: The College of Information Sciences and Technology , The Pennsylvania State University, USA ( 2013 ), http://citeseerx.ist.psu.edu/index

2. Halvani , O. : Autorschaftsanalyse im Kontext der Attribution, Verifikation und intrinsischer Exploration . Master thesis ( 2012 )

3. Koppel , M. , Schler , J.: Authorship Verification as a One-Class Classification Problem . In: Proceedings of the twenty-first international conference on Machine learning . pp. 62 -. ICML '04 , ACM , New York, NY, USA ( 2004 ), http://doi.acm. org/10 .1145/1015330.1015448