<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Authorship Verification via k-Nearest Neighbor Estimation</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Secure Information Technology SIT Rheinstrasse 75</institution>
          ,
          <addr-line>64295 Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Oren Halvani</institution>
          ,
          <addr-line>Martin Steinebach, and Ralf Zimmermann</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>In this paper we describe our k-Nearest Neighbor (k-NN) based Authorship Verification method for the Author Identification (AI) task of the PAN 2013 challenge. The method follows an ensemble classification technique based on the combination of suitable feature categories. For each chosen feature category we apply a k-NN classifier to calculate a style deviation score between the training documents of the true author A and the document from an author, who claims to be A. Depending on the score and a given threshold, a decision for or against the alleged author is generated and stored into a list. Afterwards, the final decision regarding the alleged authorship is determined through a majority vote among all decisions within this list. The method provides a number of benefits as for instance the independence of linguistic resources like ontologies, thesauruses or even language models. A further benefit is the language-independency among different Indo-European languages as the approach is applicable on languages like Spanish, English, Greek or German. Another benefit is the low runtime of the method, since there is no need for deep linguistic processing like POS-tagging, chunking or parsing. Moreover, the method can be extended or modified for instance by replacing the classification function, the threshold or the underlying features including their parameters (e.g. n-Gram sizes or the amount of feature frequencies). In addition to the PAN 2013 AI-training-corpus, where we gained an overall accuracy score of 80%, we also evaluated the algorithm on our own dataset with an accuracy of 77.50%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Authorship Verification (AV ) is a subdiscipline of Authorship Analysis [8, Page: 3],
where the task is to verify if a given document DA? of an alleged author A has been
really written by A or not. In order to verify the authorship of DA? some components
are mandatorily required. Generally, this includes a set of sample documents from A,
a set of features (concrete "style markers") and at least one classification method that
makes the decision regarding the alleged authorship. From the perspective of Machine
Learning, AV appears to be an instance of binary classification problems, since the
expected output of AV system regarding the alleged authorship is either "yes" or "no"
(formally A or :A). On a second glance, however, it turns out that AV must describe an
one-class-classification problem, [3] since A is the only target class to be distinguished
among all other classes (authors), which can be theoretically infinite.</p>
      <p>In contrast to the related discipline Authorship Attribution (AA), which is also a
subfield of Authorship Analysis, AV has not gained the same popularity in research.
This statement can be observed from the number of publications within scientific
gateways/portals as for instance ACM, CiteCeer or Springer in Table 1.</p>
    </sec>
    <sec id="sec-2">
      <title>Portal Number of publications according to both disciplines Source</title>
      <p>ACM Digital Library 50 for AV vs. 391 for AA [4]
CiteCeerX 27 for AV vs. 377 for AA [1]
SpringerLink 25 for AV vs. 267 for AA [7]</p>
      <p>One possible reason for this is that AV can be more complicated to handle as AA,
since it represents an open-class problem by default, while AA can be both an open or
a closed-set problem. Open-set means that the true author of a given document is not
included in a set, while in a closed-set the opposite is the case. Upendra et al. mentioned
in [6] that most of the research focusses on the closed-set case, which is easier to solve
as the open-set case problem. The latter one, however, is closely related to AV , since a
decision must be formulated that descibes if an alleged author is accepted to be the true
author or not.</p>
      <p>In this paper we propose an intuitive scheme for AV , based on our previous work
in [2, Page: 99]. The core of the method relies on the Nearest Neighbor
one-classclassification technique described in [9, Page: 69]. The main idea of our approach is
to compute and compare style deviation scores depending on so-called feature category
combinations between DA? and a set of training documents of A. Afterwards the overall
decision regarding the alleged authorship is determined, based on a majority vote among
all single decisions (according to the chosen feature category combinations).
2</p>
      <sec id="sec-2-1">
        <title>Notation</title>
        <p>To increase the readability of this paper we decided to follow the notation in Table 2,
which we have also defined in our previous work [2]. The notation will be used both in
tables and in continuous text.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Symbol Decription</title>
      <p>A Denotes the true author.</p>
      <p>DA? Denotes a document of an alleged author, who claims to be A.</p>
      <p>DA Denotes a set f D1A; D2A; : : : g containing sample documents of A.
F Denotes a feature category consisting of f f1; f2; : : : g features.
F Denotes a feature vector (e.g. FA? as the feature vector represenatation of DA?).</p>
      <sec id="sec-3-1">
        <title>Features</title>
        <p>In this section we describe which features we used in our AV system for the PAN task.
Typically features are taken from linguistic layers, which can be understood as
structured units of an unstructured text. Stamatatos [8] and Loose [5] for instance mention
the following linguistic layers, which can be looked-up for features:
1. Phoneme layer: This layer includes phoneme-based features that can be won out
of texts by using (pronouncing) dictionaries, e.g. IPA.
2. Character layer: This layer includes character-based features as for instance
prefixes, suffixes or letter n-Grams.
3. Lexical layer: This layer includes token-based features as for instance function
words or Pos-Tags (Part-Of-Speech Tags).
4. Syntactic layer: This layer includes syntax-based features as for instance
constituents (e.g. nominal phrases) or collocations.
5. Semantic Layer: This layer includes semantic-based features especially semantic
relations (homonyms, hyponymous, synonymys, meronyms, etc. ).</p>
        <p>Instead of utilizing features according to these layers, we decided to define the term
"feature categories" that allows us to distinguish more precisely among single features.
For example, considering the features a,o and u, the appropriate feature category would
be letters. Hence, a feature category can be seen as concrete concept of features
belonging to a specific (more abstract) linguistic layer. Table 3 lists all feature categories that
have been used by our AV system.</p>
        <p>
          Fi Feature category Examples
F1 Punctuation marks -, _, ,, ., :, ;, (), [], {}
F2 Letters a, b, c, : : : , x, y, z, A, B, C, : : : , X, Y, Z
F3 Letter n-Grams en, er, th, ted, ough
F4 Token k-prefixes [removed] [re], [confirmed]
F5 Token k-suffixes [extended] [ed], [available]
F6 Function words and, or, the, on, in, while
F7 Function word n-Grams (which, is, or), (that, on, the, above)
F8 Sentence k-beginning function words (The : : : ), (Since the : : : )
F9 Token n-Grams (such that), (it could not)
F10 Token n-Gram lengths (of the) (
          <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
          ), (are known as)
F11 Token n-Gram k-prefixes (has been more) (ha, be, mo)
F12 Token n-Gram k-suffixes (has been more) (as, en, re)
Remarks
n 2 f2; 3; 4g
k 2 f2; 3g
k 2 f2; 4g
k 2 f3; 4g
k 2 f1; 2g
k 2 f2; 3g
(
          <xref ref-type="bibr" rid="ref2 ref3">3,5,2</xref>
          ) k 2 f2; 4g
n = 3; k = 2
n = 3; k = 2
[con]
[able]
The majority of the feature categories in Table 3 can be parameterized in various ways,
e.g. n-Gram sizes, k-prefix/suffix lengths or the number of entries within
dictionarybased feature categories (for instance F6). Moreover, the frequencies of the extracted
features for each Fi can be adjusted, such that one can decide to use all or just the top t
extracted features. As a consequence, there is a dependency of these parameters when
constructing feature vectors, which also affects our AV system regarding accuracy and
runtime. For the PAN competition, however, we could not find the best parameters,
due to the fact that the parameter-space is extremely huge and can only be searched
through by special techniques as for instance evolutionary algorithms. Such techniques,
however, are beyond the scope of our approach and are subject of future work.
4
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Verification via k-NN Estimation</title>
        <p>In this section we give a detailed description of our AV system. For overview purposes
we divided the entire process into three subsections, where we first explain what kind
of preprocessing we perform to clean and normalize the data, then we describe the core
verification algorithm and finally we show how our system determines the decision
regarding the alleged authorship.
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Preprocessing</title>
      <p>Given DA? as an input document which is going to be verified regarding the alleged
authorship and DA as the set of sample documents of the true author, the first step is
to perform preprocessing on all documents in terms of normalization and noise
reduction. Normalization is essential to treat all documents uniquely, while noise reduction
is important to increase the quality of extracted features. We normalize the documents
by several operations as for instance substituting successive blanks by only one or by
replacing diacritics by their appropriate representatives, e.g. "ñ" "n". Moreover, we
equalize the lengths of the training documents. For this we concatenate all DiA 2 DA
into a single document, which is then splitted up into ` + 1 (near) equal-sized training
documents D1A; D2A; : : : ; D`+1A. Regarding noise reduction we remove citations,
markup tags, formulas and non-words as for instance "DFT-Eq.(6.1)".
4.2</p>
    </sec>
    <sec id="sec-5">
      <title>Verification</title>
      <p>Our verification algorithm is based on a k-NN classifier which, by its nature, is only
able to work on numeric values. Therefore, we first must construct the feature vector
representations FA? from DA? and F1A; F2A; : : : ; F`+1A from all the training
documents. Beforehand, we have to choose a set of feature categories that should be taken
into account. This set is denoted by F and is always odd-numbered, otherwise the later
majority-vote determination would be inapplicable. Each feature vector consists of
exactly n = jFij values that represent relative frequencies within its underlying document,
according to Fi 2 F. It should be noted that in our scheme it can never be the case that
a feature vector is mixed (includes features among more than one category). After all
feature vectors have been constructed, we calculate style deviation scores between FA?
and each Fj A for all Fi 2 F. A style deviation score describes a value sj 2 R +[f0g
and is calculated through a distance function dist( ; ) as for example the euclidean
distance. Hence, the deviation score can be written formally as sj = dist(FA?; Fj ).
A
The lower sj is, the lower is the style deviation between FA? and Fj A. Therefore
sj = 0 is interpretated as absolutely identical, while sj ! 1 is interpretated as totally
different in terms of style. The resulting deviation scores are stored together with their
corresponding feature vectors into a list, which is sorted (according to the scores) by
ascending order:</p>
      <p>Outer_Distances = (s1; F1A); (s2; F2A); : : : ; (s`+1; F`+1A)</p>
      <p>Next, we pick s1 (which represents the smallest style deviation) and its
corresponding feature vector F1A from the first tuple within Outer_Distances and discard the
remaining list. After that, we calculate again style deviation scores but this time between
F1A and each FiA 2 f F2A; F3A; : : : ; F`+1A g. The scores sj0 = dist(F1A; FiA) are
then stored (without their feature vectors) into the following list:</p>
      <p>Inner_Distances =
s20 ; s30 ; : : : ; s`0+1
4.3</p>
    </sec>
    <sec id="sec-6">
      <title>Decision</title>
      <p>To obtain a single decision (A or :A) regarding a chosen feature category Fi we
first calculate the average of the k scores within Inner_Distances and denote it as
avg_dist. Here, k refers to the k nearest neighbours of F1A. Then, we define the
acceptance decision as follows:</p>
      <p>s1
avg_dist</p>
      <p>If the acceptance decision holds such that the threshold is not exceeded, A is
accepted as the true author of DA?, otherwise not. It should be noted that in all our
experiments we chose = 1, since we have observed quite stable results with it in
comparison with other thresholds. After all the decisions concerning F have been
determined, we store them into a list denoted by Decisions, which can look like this, for
example:</p>
      <p>Decisions =</p>
      <p>A; A; A; :A; :A</p>
      <p>Finally we apply a majority-vote over Decisions and treat the resulting outcome
as the overall decision. In the case of the above example the AV system will judge that
DA? has been written by A.
5</p>
      <sec id="sec-6-1">
        <title>Experiments</title>
        <p>In our experiments we consider the PAN corpus ("PAN 2013 AI-training-corpus"),
which has been provided for the Author Identification task. It consists of 189
training documents across the languages Greek (GR), English (EN) and Spanish (SP) and is
divided into the datasets jCGRj = 20, jCEN j = 10 and jCSP j = 5.</p>
        <p>For our AV system we tested ten different combinations of feature categories and
the parameters listed in Table 4. The results are calculated in simple and weighted
accuracies, where a simple accuracy is calculated as:
? =
?CGR + ?CEN + : : :
jCGR [ CEN [ : : : j</p>
        <p>Number of correct answers per dataset Ci
, with ?Ci = Total number of documents per dataset Ci
while in contrast, a weighted accuracy is calculated as:
(weighted)? = jCGRj ?CGR + jCEN j ?CEN + : : :</p>
        <p>jCGR [ CEN [ : : : j</p>
        <p>It should be highlighted that, due to the low accuracies of some feature categories,
not all Fi appear within the feature category combinations in Table 5.</p>
        <p>F
f F1; F3; F9 g
f F1; F3; F7; F8; F12 g
f F1; F2; F3 g
f F1; F4; F9 g
f F1; F3; F9; F11; F12 g 80 %
f F7; F9; F11 g 60 %
f F3; F6; F7; F11; F12 g 60 %
f F2; F5; F6 g 80 %
f F3; F7; F9 g 20 %
f F4; F6; F7 g 40 %
?CSP ?CEN ?CGR
80 % 90 % 70 % 80 %
80 % 80 % 65 % 75 %
80 % 80 % 55 % 71:67 %
80 % 80 % 60 % 73:33 %
80 % 55 % 71:67 %
60 % 50 % 56:67 %
50 % 55 % 55 %
40 % 40 % 53:33 %
70 % 50 % 46:67 %
40 % 60 % 46:67 %
? (weighted) ?
77:14 %
71:42 %
65:71 %
68:57 %
65:71 %
54:28 %
54:28 %
45:71 %
51:43 %
51:43 %</p>
        <p>In order to test the language independence of our method, we decided to extend
the PAN corpus with an additional german dataset denoted by CDE . The CDE dataset
consists of 400 documents, extracted from various German-speaking e-Books from the
publicly available "Online Book Catalog" of the Gutenberg Project website:
CDE includes 40 problem-cases, where each case includes nine sample documents of
A and one document from an author, who claims to be A. The average size of one
document is 18 KByte, where each document comprises 4KByte of noisy data. Table
6 summarizes the results according to the extended PAN corpus.</p>
        <p>F ?CSP ?CEN ?CGR ?CDE
f F1; F3; F9 g 80 % 90 % 70 % 67:5 % 76:86 %
f F1; F3; F7; F8; F12 g 80 % 80 % 65 % 77:5 % 75:63 %
f F1; F2; F3 g 80 % 80 % 55 % 75 % 72:5 %
f F1; F4; F9 g 80 % 80 % 60 % 62:5 % 70:63 %
f F1; F3; F9; F11; F12 g 80 % 80 % 55 % 62:5 % 69:38 %
f F7; F9; F11 g 60 % 60 % 50 % 60 % 57:5 %
f F3; F6; F7; F11; F12 g 60 % 50 % 55 % 62:5 % 56:88 %
f F2; F5; F6 g 80 % 40 % 40 % 65 % 56:26 %
f F3; F7; F9 g 20 % 70 % 50 % 67:5 % 51:86 %
f F4; F6; F7 g 40 % 40 % 60 % 60 % 50 %</p>
        <p>In our experiments we observed several phenomenons, where the most important
one is that the parameter settings seems to have a very strong influence on the results. In
Table 7 we list alternative parameter settings for the best feature category combination
F = f F1; F3; F9 g regarding the PAN corpus (without CDE ).</p>
      </sec>
      <sec id="sec-6-2">
        <title>Conclusion &amp; future work</title>
        <p>In this paper we presented an intuitive scheme to automatically verify alleged
authorships of text documents. Our method provides several benefits as for instance language
independence (at least for Indo-European languages) and also independence of
linguistic resources (e.g. ontologies, thesauruses, language models, etc.). A further benefit is
the low runtime of the method, since there is no need for deep linguistic processing like
POS-tagging, chunking or parsing. Another benefit is that the involved components
within the method can be replaced easily as for example the style deviation method,
the acceptance-threshold or the feature categories including their parameters.
Moreover, the components can be extended or combined e.g. through ensemble-techniques
(combination of several style deviation methods).</p>
        <p>Unfortunately, besides benefits our approach includes several pitfalls, too. One of
the biggest challenges, for example, is the inscrutability of the methods
parameterspace, due to the fact that the number of configuration settings is near infinite. Such
settings include for instance the number of extracted features from a specific category
or feature-specific parameters (e.g. n-Gram sizes, k-prefix/suffix lengths, etc.).
Therefore, it must be investigated how to find the very best corpus-independent parameter
settings. This could be accomplished, for instance, with evolutionary algorithms.</p>
        <p>Another challenge that remains unsolved is the influence of the text topic, which can
distort the verification results. For some feature categories, e.g. F3 (Letter n-Grams), we
encountered the presence of content words which are mostly topic related. Yet it is not
clear if the verification results remains the same if training documents derive from
different sources (e.g. e-Mails, scientific papers, social networks messages, movie reviews,
etc.), which may differ heavily in register, genre and style. Hence, a more detailed
investigation is required in the near future, which involves special corpora construction.
7</p>
      </sec>
      <sec id="sec-6-3">
        <title>Acknowledgements</title>
        <p>This work was supported by the CASED Center for Advanced Security Research
Darmstadt, Germany funded by the German state government of Hesse under the LOEWE
programme (http://www.CASED.de).
4. Library, T.A.D.: Association for Computing Machinery, New York, NY, USA (2013),
http://dl.acm.org
5. Loose, F.: Paarweise Autorenschaftsverifikation von kurzen Texten. Master thesis,</p>
        <p>Bauhaus-Universität Weimar, Fakultät Medien, Medieninformatik (May 2011)
6. Sapkota, U., Solorio, T.: Sub-Profiling by Linguistic Dimensions to Solve the Authorship
Attribution Task. In: Forner, P., Karlgren, J., Womser-Hacker, C. (eds.) CLEF (Online
Working Notes/Labs/Workshop) (2012),
http://dblp.uni-trier.de/db/conf/clef/clef2012w.html#SapkotaS12
7. SpringerLink: Springer-Verlag GmbH, Heidelberg, Germany (2013),</p>
        <p>http://www.springerlink.com
8. Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. J. Am. Soc. Inf. Sci.</p>
        <p>
          Technol. 60(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ), 538–556 (Mar 2009), http://dx.doi.org/10.1002/asi.v60:3
9. Tax, D.M.J.: One-Class Classification. Concept Learning In the Absence of
Counter-Examples. Ph.D. thesis, Delft University of Technology (2001),
http://www-ict.ewi.tudelft.nl/˜davidt/thesis.pdf
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <source>CiteSeer: The College of Information Sciences and Technology</source>
          , The Pennsylvania State University, USA (
          <year>2013</year>
          ), http://citeseerx.ist.psu.edu/index
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Halvani</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Autorschaftsanalyse im Kontext der Attribution, Verifikation und intrinsischer Exploration</article-title>
          .
          <source>Master thesis</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Authorship Verification as a One-Class Classification Problem</article-title>
          .
          <source>In: Proceedings of the twenty-first international conference on Machine learning</source>
          . pp.
          <fpage>62</fpage>
          -.
          <source>ICML '04</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2004</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/1015330.1015448
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>