<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>CAccuracy
CTr-Dutch</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Generic Authorship Verification Scheme Based on Equal Error Rates</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Secure Information Technology SIT Rheinstrasse 75</institution>
          ,
          <addr-line>64295 Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>6</volume>
      <issue>0</issue>
      <abstract>
        <p>We present a generic authorship verification scheme for the PAN-2015 identification task. Our scheme uses a two-step training phase on the training corpora. The first phase learns individual feature category parameters as well as decision thresholds based on equal error rates. The second phase builds feature category ensembles which are used for majority vote decisions because ensembles can outperform single feature categories. All feature categories used in our method are very simple to gain multiple advantages: Our method is entirely independent of any external linguistic resources (even word lists), and hence it can easily applied to many languages. Moreover, the classification is very fast due to simple features. Additionally, we make use of parallelization. The evaluation of our scheme on a 40% split (which we did not use for training) of the official PAN2015 training corpus led to an average corpus accuracy of 68.12%; in detail 60% for the Dutch, 67.5% for the English, 60% for the Greek and 85% for the Spanish subcorpus. The overall computation runtime was approximately 27 seconds.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Authorship analysis forms a sub-discipline of digital text forensics (a branch of digital
forensics). Authorship attribution (AA) and authorship verification (AV) can be seen as
the two major disciplines of authorship analysis [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        AA deals with the problem of attributing a given anonymous text to exactly one
author, given a training set of authors for which text samples of undisputed authorship
are available [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. AV, on the other hand, aims to determine whether a given text was
written by a certain author A [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The training set in this case comprises only text
samples from the supposed author A, that can be analyzed for distinct features that
make up the writing style of A. If it differs significantly from the writing style of the
given text, the authorship is considered as false, otherwise true. AVcan be understood
as a specialized task of AA with an open set of candidate authors. Conversely, AA can
be treated as a sequence of verification problems regarding the given candidate authors
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and should be solvable if the verification problem is solved [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In this work we propose a new intrinsic AV method that offers a number of
benefits. The method provides a universal threshold per language and feature category,
used to accept or reject the alleged authorship of a document. Here, universal refers
to the ability to generalize across different genres and topics of the texts. In addition,
the model generated by the method is highly flexible and can be extended
incrementally in order to handle new languages, genres, topics or features. Besides, the method
neither requires complex machine learning algorithms nor error-prone natural language
processing techniques nor external linguistic resources. Furthermore, it features a very
low computational complexity, compared to other existing approaches (including our
previous PAN-approaches).
2</p>
      <p>Features
Features are the core of any AV method. Without features it is not possible to model
individual writing styles of authors. Many researchers across different fields (linguistics,
computer science, mathematics, etc.) have attempted to distinguish features in various
ways. As a result, features have been classified into topic features, non-topic features,
stylistic features, lexical features, syntactic features, semantic features, etc. In order to
avoid using these terms, which can be very misleading, we decided to use the generic
term of features, which we sub-categorize into so-called feature categories. Regarding
our approach, we make use of nine feature categories, which are listed in Table 1.</p>
    </sec>
    <sec id="sec-2">
      <title>Fi Name</title>
      <sec id="sec-2-1">
        <title>F1 Punctuation n-grams</title>
      </sec>
      <sec id="sec-2-2">
        <title>F2 Character n-grams</title>
      </sec>
      <sec id="sec-2-3">
        <title>F3 n% frequent tokens</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Feature description</title>
    </sec>
    <sec id="sec-4">
      <title>Parameter range</title>
      <p>A sequence of n consecutive punctu- n 2 f1; 2; : : : ; 10g
ation marks (commas, hyphens, etc.)
taken from D after reduction to
punctuation characters.</p>
      <p>A sequence of n consecutive charac- n 2 f1; 2; : : : ; 10g
ters in D.</p>
      <p>The n% most frequently occurring n 2 f5; 10; : : : ; 50g
tokens in D (for n 30 that will be
mostly function words).</p>
      <p>The first k characters of a token.</p>
      <p>The last k characters of a token.</p>
      <p>F4 Token k-prefixes k 2 f1; 2; 3; 4g
F5 Token k-suffixes k 2 f1; 2; 3; 4g
F6 Token k-prefix n-grams The first k characters of each token n 2 f2; 3; 4g,
within a token n-gram. k 2 f1; 2; 3; 4g
F7 Token k-suffix n-grams The last k characters of each token n 2 f2; 3; 4g,
within a token n-gram. k 2 f1; 2; 3; 4g
F8 n-prefixes– k-suffixes The first n and last k characters of a n; k 2 f1; 2; 3; 4g
token.</p>
      <p>F9 n-suffixes– k-prefixes The last n characters of a token and n; k 2 f1; 2; 3; 4g
the first k characters of the next
token.</p>
      <p>
        The extraction of features from a text can be performed on any linguistic layer (e.g.,
character layer, morphological layer, lexical layer, syntactic layer or semantic layer).
Depending on the layer, different tools are required for the extraction process. For
example, extracting features from the lexical layer, involves tokenizers, stemmers, or
lemmatizers while extracting features from the semantic layer requires part-of-speech taggers,
parsers, or thesauri [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        A feature can be explicit, i.e extracted directly from the text, or implicit, i.e.
extracted after the text has been preprocessed. Simple letters, for example, can be seen as
explicit features, as they can be looked up in the text itself. In contrast, more complex
features like part-of-speech tags (adjectives, articles, pronouns, etc.) are implicit, since
they are not annotated in the text and thus can only be extracted after the text has been
preprocessed. In our method we focused only on explicit features, extracted from the
first four mentioned layers. There are a number of reasons for this decision:
1. A previous observation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has shown that implicit features not necessarily lead to
promising results in contrast to explicit features (for example character n-grams).
2. Implicit features rely often on natural language processing models that are bounded
to only one language and thus are not generalizable.
3. Implicit features often cause a high computation runtime since various types of
natural language processing techniques are utilized.
3
      </p>
      <p>Used corpora
In this section we present the training and test corpora used by our method and describe
their inner structure.
3.1</p>
      <p>Training and test corpora
In order to construct and evaluate our AVmodel, we used the publicly released
PAN2015 training data set1 denoted by CPan15. This corpus is divided into four subcorpora
CDutch, CEnglish, CGreek and CSpanish, where each sub corpus consists of 100 problems
f 1; 2; : : : ; 100g.</p>
      <p>During the time our method was created and trained, the official test corpus has not
been released. Hence, we used 60% of each subcorpus as a training and the remaining
40% as a test corpus.</p>
      <p>We denote2 the training corpus as CTrain = fCTr-Dutch; CTr-English; CTr-Greek; CTr-Spanishg
and the test corpus as CTest = fCTe-Dutch; CTe-English; CTe-Greek; CTe-Spanishg. CTrain serves
exclusively to construct the models (parameters, thresholds and ensembles) while CTest
is used to test our AVscheme.
1 The original file name is pan15-authorship-verification-training-dataset-2015-04-19.
2 The prefixes Tr and Te denote if a corpus C 2 (CTrain [ CTest) is a training or a test corpus.
3.2</p>
      <p>Corpus inner structure
The inner structure of the used corpora (whether training or testing) in our method is
described as follows. A corpus C comprises a set of problems f 1; 2; : : :g, where a
problem = (DA? ; DA) consists of a questioned document DA? and a set of known
documents DA. The number of known documents in each problem varies from one to
five documents. The lengths of the texts in these corpora vary between a few hundred
and a few thousand words. The distribution of true and false authorships in each corpus
is uniform3. We denote a true authorship by Y and a false authorship by N.
4</p>
      <p>Our method
In this section we explain first which kind of preprocessing is applied on the training
and test corpora. Next, we introduce both models our method relies on, and finally,
we describe the treatment of verification problems in the test corpora which involves
(among other things) similarity calculation and classification.
4.1</p>
      <p>Preprocessing
Given some corpus C, for each document D within C, we replace newlines, tabs and
multiple spaces by only one space. This enables a better extraction of several features,
as for instance, character n-grams. Furthermore, we concatenate all known documents
within each problem into one document, such that (from now on) each consists only
of the two documents DA and DA? . Besides this, no other preprocessing techniques are
applied on the data.
4.2</p>
      <p>Model training
Our AVscheme depends on two models that are learned on each training corpus C 2
CTrain mentioned in Section 3. The first model, denoted as MF, includes those
parameters for each feature category (listed in Table 1) that lad to the highest accuracy on
C. Moreover, MF includes for each feature category Fi 2 F a threshold Fi , which is
determined as threshold of the equal error rate (EER). The second model, denoted as
ME , builds on MF and includes for each corpus C an ensemble of feature categories
F1; F2; : : : ; F9 that lad to the highest accuracy.</p>
      <p>Constructing MF In order to construct the feature category model MF, we use
the Algorithms 1 6. Note that some specific methods are not listed, as for instance
Length() or LanguageOfCorpus(C) as the semantics behind these methods is obvious.
Constructing ME In order to construct the ensemble model ME , we use the
Algorithms 7 8.
3 Note that the uniform Y / N distribution is important for the training phase of our method.
end
end
end</p>
      <p>! (n; k; F ; CAccuracy).
/* Here we construct the model MF, which holds for
each language (represented through C) those (n; k)
parameters and F that lead to the highest
accuracy on C.
*/
MF</p>
      <p>fh(); h( ; ; ; )iig // Semantic: Language
for entry 2 ThresholdDictionary do</p>
      <p>LanguageOfCorpus(entry.Key)
L
promisingTuple Select the tuple (n; k; F ; CAccuracy) from entry where
CAccuracy is the maximum among all tuples within entry.</p>
      <sec id="sec-4-1">
        <title>MF.Add((L; promisingTuple))</title>
        <p>end
end
return MF</p>
        <p>Algorithm 1: Construct the MF model.
Output: F
Problem2SimilarityDictionary ComputeProblemsSmilarities(C; F; n; k)
psaList Select all tuples h( ; sim; trueAuthor)i from
Problem2SimilarityDictionary
Y = psaList.Filter(trueAuthor == "Y").Select(sim).OrderByAscending()
N = psaList.Filter(trueAuthor == "N").Select(sim).OrderByAscending()
F</p>
        <p>DetermineThresholdViaEER(Y; N );
return F</p>
        <p>Algorithm 2: DetermineThreshold(C; F; n; k).
Problem2SimilarityDictionary = f g
for 2 C do
sim ComputeProblemSmilarity(DA? ; DA; F; n; k)
trueAuthor GetTruth( )</p>
      </sec>
      <sec id="sec-4-2">
        <title>Problem2SimilarityDictionary.Add(h ; (F; sim; trueAuthor)i)</title>
        <p>end
return Problem2SimilarityDictionary</p>
        <p>Algorithm 3: ComputeProblemsSmilarities(C; F; n; k).
Input: DA? ; DA; F; n; k
Output: Problem2SimilarityDictionary</p>
      </sec>
      <sec id="sec-4-3">
        <title>DA-Features</title>
      </sec>
      <sec id="sec-4-4">
        <title>DA? -Features</title>
        <p>ExtractFeatures(DA; F; n; k)
ExtractFeatures(DA? ; F; n; k)</p>
      </sec>
      <sec id="sec-4-5">
        <title>DA-NormalizedFeatureDictionary</title>
        <p>NormalizeFeaturesByRelativeFrequency(DA-Features)</p>
      </sec>
      <sec id="sec-4-6">
        <title>DA? -NormalizedFeatureDictionary</title>
        <p>NormalizeFeaturesByRelativeFrequency(DA? -Features)</p>
      </sec>
      <sec id="sec-4-7">
        <title>Vocabulary DA-NormalizedFeatureDictionary.Keys</title>
        <p>[ DA-NormalizedFeatureDictionary.Keys</p>
      </sec>
      <sec id="sec-4-8">
        <title>DA-VocabularyDictionary</title>
      </sec>
      <sec id="sec-4-9">
        <title>Restrict2Vocabulary(DA-NormalizedFeatureDictionary)</title>
      </sec>
      <sec id="sec-4-10">
        <title>DA? -VocabularyDictionary</title>
      </sec>
      <sec id="sec-4-11">
        <title>Restrict2Vocabulary(DA? -NormalizedFeatureDictionary)</title>
        <p>X
Y
dist
sim</p>
      </sec>
      <sec id="sec-4-12">
        <title>DA-VocabularyDictionary.Values</title>
      </sec>
      <sec id="sec-4-13">
        <title>DA? -VocabularyDictionary.Values</title>
        <p>m
P jX[i]
i=1</p>
        <p>Y [i]j // Manhattan Distance
1</p>
        <p>// Linear similarity transformation
1 + dist
return sim</p>
        <p>Algorithm 4: ComputeProblemSmilarity(DA? ; DA; F; n; k).
Output: F
len
i
j</p>
        <p>Algorithm 5: DetermineThresholdViaEER(Y; N ).
Input: C; F ; F; n; k</p>
      </sec>
      <sec id="sec-4-14">
        <title>Output: CAccuracy</title>
        <p>truePositives 0
Problem2SimilarityDictionary
for h ; (F; sim; trueAuthor)i in Problem2SimilarityDictionary do
predictedAuthor "Undefined"
if sim &gt; F then</p>
        <p>predictedAuthor "Y"
else if sim &lt; F then</p>
        <p>predictedAuthor "N"
end
CAccuracy =</p>
        <p>jCj
return CAccuracy</p>
        <p>100
if predictedAuthor == trueAuthor then
truePositives++</p>
        <p>truePositives
Algorithm 6: CalculateCorpusAccuracy(C; F ; F; n; k).</p>
        <p>ComputeProblemsSmilarities(C; F; n; k)
Now that we have constructed the models we need also to test them. This is done in
Algorithm 9.
corpora CTest = fCTe-Dutch; CTe-English; CTe-Greek; CTe-Spanishg.</p>
        <p>In this section we describe the evaluation results. First, we present the models MF and
ME learned from the training corpora CTrain = fCTr-Dutch; CTr-English; CTr-Greek; CTr-Spanishg.
Afterwards, we show (given both pre-trained models) the evaluation results on the test
The training on CTrain led to the model MF = fMF1 ; MF2 ; : : : ; MF9 g which is given
in the Tables 2 10.
Given MF; CTrain and F as an input, we applied ConstructME (MF; CTrain; F) on all
training subcorpora. The resulting model ME is given in Table 11.</p>
        <p>Input: MF; CTrain; F
Output: ME
ME f g
// SuitableEnsembles includes only odd-numbered
ensembles due to a majority voting.</p>
        <p>SuitableEnsembles = GetFeatureCategoriesPowerSet(F)
for C 2 CTrain do</p>
        <p>Fpairings ConstructFpairings(MF; C; F)
// Semantic: ensemble ! corpusAccuracy
featureCategoryComboAccuracy fh(); ()ig
for ensemble 2 SuitableEnsembles do
predictions CreateFeatureCategoryEnsembleResults(ensemble, C)
corpusAccuracy EvaluatePredictionsViaTruthFile(predictions, C)
featureCategoryComboAccuracy.Add((ensemble, corpusAccuracy))
end
featureCategoryComboAccuracy Sort all entries (ensemble,
corpusAccuracy) 2 featureCategoryComboAccuracy by descending,
according to corpusAccuracy.</p>
        <p>E Select the most promising ensemble from
featureCategoryComboAccuracy, which led to the highest corpusAccuracy
ME .Add(bestEnsemble)
end
return ME</p>
        <p>Algorithm 7: ConstructME (MF; CTrain; F).</p>
        <p>Input: MF; C; F</p>
      </sec>
      <sec id="sec-4-15">
        <title>Output: Fpairings</title>
        <p>// Semantics: F ! fh ; ( F ; sim)ig.
Fpairings fh(); (; )ig
for F 2 F do</p>
      </sec>
      <sec id="sec-4-16">
        <title>Fpairings fh(); (; )ig</title>
        <p>Fpairings.Add(hF; Fpairingsi)
for</p>
        <p>2 C do
MF
n
k</p>
        <p>LoadModelFromMF(F )
LoadNFromMF ()</p>
        <p>LoadKFromMF ()
F LoadThresholdFromMF ()
sim ComputeProblemSmilarity(C; F; n; k)</p>
      </sec>
      <sec id="sec-4-17">
        <title>Fpairings.Add(h ; ( F ; sim)i)</title>
        <p>end
end
return Fpairings</p>
        <p>Algorithm 8: ConstructFpairings(MF; C; F).</p>
        <p>Input: MF; ME ; CTest; F
Output:
for C 2 CTest do</p>
        <p>Fpairings ConstructFpairings(MF; C; F)
L LanguageOfCorpus(C)
E Select the most promising ensemble from ME , for C with respect to L
predictions CreateFeatureCategoryEnsembleResults(E , C)
CAccuracy EvaluatePredictionsViaTruthFile(predictions, C)</p>
        <p>Print(predictions, CAccuracy)
end
CTest.</p>
        <p>Algorithm 9: Apply the AVscheme, based on the pre-trained models, on test corpora
The evaluation of our approach (based on the pre-trained models MF and ME ) on the
test corpus CTest led to the results, listed in Table 12. The runtime needed for the entire
evaluation was 27 seconds.
We presented a generic and fast authorship verification scheme for the Author
Identification task within the PAN-2015 competition. Our method provides a number of
benefits. The most important benefit (in comparison to our previous PAN-approaches) is that
our method finally makes use of trainable models. These models not only contribute to
the low runtime since thresholds do not need to be computed over and over again, but
also can be extended incrementally and so enable a better generalization in terms of
languages, genres or topics. Another benefit that directly comes from the (transparent)
models is that it makes the approach better interpretable as one is able to inspect the
effects of the different feature category parameters as well as the decision thresholds. In
future work we will focus on this detail. A further benefit of our method is that there is
no need for deep linguistic processing (e.g., POS-tagging, chunking or parsing) or
linguistic resources such as word lists, word-nets, thesauruses, etc. This causes not only a
low runtime but also makes the method portable. Besides this, the method can be
modified easily regarding its underlying components. For example, the distance function can
be replaced by an ensemble of distance functions with almost no effort.</p>
        <p>However, there are also many open issues in our approach that leave room for
improvement. Here, the most important issue is that at the moment our method is difficult
to comprehend, which makes it complicated to understand and implemented by others.
Therefore, it is planed in future work to re-factor the algorithms in order to make the
scheme more compact.</p>
        <p>Acknowledgments. This work was supported by the Center for Advanced Security
Research Darmstadt CASED (www.CASED.de), funded by the German state government
of Hesse under the LOEWE program and by the German Federal Ministry of Education
and Research (BMBF) in the funded projects ALPhA and EWV.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Overview of the Author Identification Task at PAN 2013</article-title>
          . In: Forner,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Tufis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.) Working Notes for CLEF 2013 Conference , Valencia, Spain,
          <source>September 23-26</source>
          ,
          <year>2013</year>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>1179</volume>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2013</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1179</volume>
          /
          <article-title>CLEF2013wn-PAN-JuolaEt2013</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>The “Fundamental Problem” of Authorship Attribution</article-title>
          .
          <source>English Studies</source>
          <volume>93</volume>
          (
          <issue>3</issue>
          ),
          <fpage>284</fpage>
          -
          <lpage>291</lpage>
          (
          <year>2012</year>
          ), http://dx.doi.org/10.1080/0013838X.
          <year>2012</year>
          .668794
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Determining if Two Documents are by the Same Author</article-title>
          .
          <source>JASIST</source>
          <volume>65</volume>
          (
          <issue>1</issue>
          ),
          <fpage>178</fpage>
          -
          <lpage>187</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A Survey of Modern Authorship Attribution Methods</article-title>
          .
          <source>J. Am. Soc. Inf. Sci. Technol</source>
          .
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>Mar 2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>