<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VEBAV - A Simple, Scalable and Fast Authorship Verification Scheme</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oren Halvani?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Steinebach</string-name>
          <email>c@1</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Secure Information Technology SIT Rheinstrasse 75</institution>
          ,
          <addr-line>64295 Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>1049</fpage>
      <lpage>1062</lpage>
      <abstract>
        <p>We present VEBAV - a simple, scalable and fast authorship verification scheme for the Author Identification (AI) task within the PAN-2014 competition. VEBAV (VEctor-Based Authorship Verifier), which is a modification of our existing PAN-2013 approach, is an intrinsic one-class-verification method, based on a simple distance function. VEBAV provides a number of benefits as for instance the independence of linguistic resources and tools like ontologies, thesauruses, language models, dictionaries, spellcheckers, etc. Another benefit is the low runtime of the method, due to the fact that deep linguistic processing techniques like POS-tagging, chunking or parsing are not taken into account. A further benefit of VEBAV is the ability to handle more as only one language. More concretely, it can be applied on documents written in Indo-European languages such as Dutch, English, Greek or Spanish. Regarding its configuration VEBAV can be extended or modified easily by replacing its underlying components. These include, for instance the distance function (required for classification), the acceptance criterion, the underlying features including their parameters and many more. In our experiments we achieved regarding a 20%-split of the PAN 2014 AI-training-corpus an overall accuracy score of 65,83% (in detail: 80% for Dutch-Essays, 55% for Dutch-Reviews, 55% for English-Essays, 80% English-Novels, 70% for GreekArticles and 55% for Spanish-Articles).</p>
      </abstract>
      <kwd-group>
        <kwd>Authorship verification</kwd>
        <kwd>one-class-classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Authorship Verification (AV ) is a sub-discipline of Authorship Analysis [4, Page: 3],
which itself is an integral part of digital text forensics. It can be applied in many forensic
scenarios as for instance checking the authenticity of written contracts, threats, insults,
testaments, etc. where the goal of AV remains always the same: Verify if two
documents DA and DA? are written by the same author A, or not. An alternative
reformulation of the goal is to verify the authorship of DA? , given a set of sample documents of
A. From a Machine Learning perspective AV clearly forms an one-class-classification
problem [3], due to the fact that A is the only target class to be distinguished among all
other possible classes (authors), where their number can be theoretically infinite.
? Corresponding author.</p>
      <p>In order to perform AV at least four components are mandatorily required:
– The document DA? , which should be verified regarding its alleged authorship.
– A training set DA = fD1A; D2A; ...g, where each DiA represents a sample
document of A.
– A set of features F = ff1; f2; ...g, where each fj (style marker) should help to
model the writing style of DA? and each DiA 2 DA.
– At least one classification method, which accepts or rejects the given authorship
based on F and a predefined or dynamically determined threshold .</p>
      <p>The aim of this paper is to provide a simple, scalable and fast AV scheme, which
offers a number of benefits as for instance promising detection rates, easy implementation,
low runtime, independence of language or linguistic resources as well as easy
modifiability and expandability. Our proposed AV scheme, denoted by VEBAV, is based on our
earlier approach regarding the PAN 2013 AI-task, which itself formed a modification of
the Nearest Neighbor (NN) one-class classification technique, described by Tax in [5,
Page: 69]. In a nutshell, VEBAV takes as an input a set of sample documents of a known
author (DA) and one document of an unknown author (DA? ). All documents in DA are
first concatenated to a big document which is then splitted again into (near) eqal-sized
chunks (such that DA includes now only these chunks). Afterwards, feature vectors are
constructed from each DiA 2 DA and from DA? according to preselected feature sets.
Next, a representative is selected among the training feature vectors (based on two
options), which is important to determine the decision regarding the alleged authorship.
In the next step distances are calculated between the representative and the remaining
training feature vectors, as well as between the representative and the test feature vector.
Depending on all calculated distances a twofold decision regarding the alleged
authorship is generated, which includes a ternary decision 2 fYes, No, Unansweredg and a
probability score that describes the soundness of .</p>
      <p>The rest of this paper is structured as follows. In the following section we explain
how we extract features from the corresponding linguistic layers and which features
have been used in our approach. Afterwards, we describe in section 3 our verification
scheme. In section 4 we present the corpus that was used to evaluate our scheme, while
in section 5 we show our evaluation results. Finally, we draw our conclusions and the
challenges we were faced in section 6 and provide some ideas for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Features</title>
      <p>Features (style markers) are the core of AV , since they are able to approximate writing
styles and thus, can help to judge automatically if two texts have a very similar style. If
this holds, it is an indicator that both texts could be written by the same author. In the
next subsections we explain which sources exactly quantifiable features can be retrieved
from, which tools are required for this extraction process and also what kind of feature
sets we used in our approach.
2.1</p>
      <p>Linguistic layers
From a linguistic point of view, features are extracted from so-called linguistic layers
that can be understood as abstract units within a text. In summary, the most important
linguistic layers are the following:
– Phoneme layer: This layer includes phoneme-based features as for example
vowels, consonants or also the more sophisticated supra-segmental or prosodic features.
Such features can typically be won out of texts by using (pronouncing) dictionaries
(e.g. IPA).
– Character layer: This layer includes character-based features as for instance
prefixes, suffixes or letter n-grams, which typically are extracted from texts via regular
expressions.
– Lexical layer: This layer includes token-based features as for instance function
words or POS-Tags (Part-Of-Speech Tags). These features can be extracted from
texts via tokenizers (which often are based on simple regular expressions).
– Syntactic layer: This layer includes syntax-based features as for instance
constituents (e.g. nominal phrases) or collocations. Such features can be extracted by
sophisticated regular expressions or by natural language processing tools (e.g.
POSTagger). However, the latter one is normally bounded to a specific language model
and thus, cannot scale to multiple languages. Besides this the runtime of natural
language processing tools is much more higher as the runtime caused by pattern
matching via regular expressions.
– Semantic Layer: This layer includes semantic-based features, e.g. semantic
relations (hyponymous, synonymys, meronyms, etc.). Such features require deep
linguistic processing, which often rely on external knowledge resources (e.g.
WordNet) or complex tools (e.g. parsers, named entity eecognizers, etc.).</p>
      <p>In VEBAV we only make use of the Character, Lexical and Syntactic linguistic
layers, due to the effectiveness of their underlying features and the low runtime, caused
by the feature extraction process via regular expressions.
2.2</p>
      <p>Feature sets
In this paper we use the term feature set, denoted by F , which represents features
belonging to one or more linguistic layers. In Table 1 we list 14 feature sets that have been
used in our experiments. Here, one should pay attention that F12, F13 and F14 form
mixtures of existing feature sets. The idea behind it was to see if such mixtures can
outperform1 single feature sets when it comes to classification.
1 This was in fact the case, as can be observed in the later experiments (section 5.3).</p>
      <p>Fi Feature set Description Examples
F1 Characters All kind of characters fa,b,1,8,#,&lt;,%,!,: : :g
F2 Letters All kind of letters fa,b, , ,ä,ß,ó,á,ñ,: : :g
F3 Punctuation marks Symbols that structure sentences f.,:,;-,",(,),: : :g
F4 Word k Prefixes The k starting letters of words example fe,ex,exa,exam,: : :g
F5 Word k Suffixes The k ending letters of words example fe,le,ple,mple,: : :g
F6 Character n-grams Overlapping character-fragments ex-ample fex-,x-a,-am,: : :g
F7 Letter n-grams Overlapping letter-fragments ex-ample fexa,xam,amp,: : :g
F8 Tokens Segmented character-based units A [sample] text! fA,[sample],text!g
F9 Words Segmented letter-based units A [sample] text! fA,sample,textg
F10 Token n-grams Overlapping token-fragments Wind and rain! fWind and, and rain!g
F11 Word n-grams Overlapping word-fragments Wind and rain! fWind and, and raing
F12 Mix1 A mix of three feature sets F1 [ F3 [ F6
F13 Mix2 A mix of three feature sets F1 [ F2 [ F5</p>
      <p>F14 Mix3 A mix of four feature sets F3 [ F4 [ F5 [ F8
As can be seen in Table 1, six feature sets can be parameterized by n (n-gram sizes)
and/or k (length of prefixes/suffixes). It should be emphasized that adjusting these two
parameters can influence the results massively. Hence, we generalized both settings by
using constant values (n = k = 3), learned from earlier experiments, which turned out
to be reliable also in the current experiments among the diverse PAN subcorpora.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Proposed Verification Scheme</title>
      <p>In this section we give a detailed description of our AV scheme VEBAV. For overview
reasons we divided the entire workflow of the algorithm into six subsections, where
we first explain what kind of preprocessing we perform on the data. The other five
subsections focus on the algorithm itself.
3.1</p>
      <p>Preprocessing
In contrast to our approach in PAN-2013 we decided in our current approach neither
to apply normalization nor noise reduction techniques. Instead, we treat each text as it
is, which turned out to be not only less burdensome but also promising. Our only
preprocessing mechanism is restricted to concatenate all DiA 2 DA to a single document
DA which, depending on the length, is splitted again into ` (near) equal-sized chunks
D1A; D2A; ...; D`A. Note that in our experiments ` is statically set to five chunks if, and
only if, the length of DA is above 15,000 characters, otherwise ` is statically set to three
chunks. The reason for this decision is that we observed quite better results in contrast
by using only one fixed ` for both, short or long texts.
3.2</p>
      <p>Vocabulary Generation
In order the form a basis for the construction of feature vectors, we need to build a
global feature vocabulary, denoted by V. But beforehand, we first need to select at least
one (appropriate) feature set F , to know which sort of features should be taken into
account. Here, appropriate refers to features that...</p>
      <p>– ... are able to model high similarity between DA? and DA (for the case A? = A).
– ... are able to discriminate well between the writing style of A and all other possible
authors (for the case A? 6= A).
– ... appear in all generated vocabularies.</p>
      <p>Afterwards, we construct from DA? and each chunk DiA 2 DA the
corresponding feature vocabularies VDA? and VD1A ; VD2A ; ...; VD`A . As a last step we apply an
intersection among all constructed vocabularies to build the global vocabulary V:</p>
      <p>V = VDA? \ VD1A \ VD2A \ ... \ VD`A</p>
      <p>Note that in VEBAV we consider case-sensitive features, such that word 2 V and
Word 2 V are both valid.
3.3</p>
      <p>
        Constructing feature vectors
Once V is generated, the next step is to construct the feature vectors F1A; F2A; ...; F`A
from each DiA 2 DA and FA? from DA? . The construction process regarding these
vectors is very straightforward. We look up for each fi 2 V how often it occurs in some
document D and denote its absolute frequency by i. Next, we normalize i by the
length of D, to obtain its relative frequency, which represents a number xi 2 [0 ; 1].
Hence, the feature vector representation regarding D is formally defined by:
F = (x1; x2; ...; xn) , with n = jVj .
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
3.4
      </p>
      <p>Representative Selection
After constructing all feature vectors, a representative (training-) feature vector Frep
must be selected. This step is essential for the later determination of the decision
regarding the alleged authorship. In general, VEBAV offers two options to select Frep:
1. Selecting Frep dynamically by using a similarity function (e.g. cosine similarity)
between all training feature vectors. Here, Frep is selected according to the feature
vector, who is mostly dissimilar from the others. In other terms Frep can be
understand as an outlier.
2. Selecting Frep manually (static).</p>
      <p>Note that in our experiments we chose the first option.
3.5</p>
      <p>Distances Calculations
In this step we calculate the distances, needed for the decision determination
regarding the alleged authorship. Concretely, we calculate the distances d1; d2; ...; d` between
Frep and each F1A; F2A; ...; F`A as well as the the distance d? between Frep and FA?.
The calculation of these distances requires a predefined distance function dist(X; Y )
with X = Frep and Y 2 fF1A; F2A; ...; F`Ag. We implemented a broad range of
distance functions in VEBAV, where the majority have been taken from [1]. However, for
complexity reasons we used only the Minkowski distance function in our experiments,
which is defined formally as:
dist(X; Y ) =
jxi
yij
1
, with
2 R+ n f0g ;</p>
      <p>As a last step we calculate the average regarding the distances between Frep and
each F1A; F2A; ...; F`A as follows:
d? =</p>
      <p>
        (d1 + d2 + : : : + d`)
n
X
i=1
The goal of this step is to construct a twofold decision regarding the alleged authorship,
which consists of a ternary decision 2 fYes, No, Unansweredg and a probability score
. In order to calculate these terms both values are required d? and d?. For the latter
one we use the following adjusted form: d 0 = d? + (! ), where ! denotes a weight
and a tolerance parameter, calculated from the a standard deviation of d1; d2; ...; d`.
The idea behind ! and is to cope with the presence of noisy writing styles, that might
be the result of mixing different sample documents of A together in his training set
DA. Note that in our experiments we chose a neutral weight ( = 1), since we could
not investigate an optimal value for it (even adjustig it by 0.1 can cause a massive drop
regarding the classification results). With these we first define the probability score as:
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(6)
Next, we define the ternary decision as follows:
=
8Yes;
&gt;
&lt;
      </p>
      <p>No;
&gt;:Unanswered; (d? = d 0) _ (jd?
d 0j &lt; ")</p>
      <p>The concrete semantic behind is:
– Yes: VEBAV accepts the alleged author as the true author (A? = A).
– No: VEBAV rejects the alleged author as the true author (A? 6= A).
– Unanswered: VEBAV is unable to generate a prediction because d? and d 0 are
equal/near-equal or due to another unexpected result. In any case is set to 0.5.
Note that depending on how " was chosen (regarding the case that d? and d 0 are
near-equal) the number of Unanswered decisions, in the context of a corpus
evaluation, can vary considerably. In our experiments we chose " = 0; 001 as this small
value restricts the number of Unanswered decisions, where is near 0.5.
Regarding our experiments we used the official PAN-2014 Author Identification
training corpus, denoted by CPAN-14, which has been released2 by the PAN organizers on
April 22, 2014. CPAN-14 consists of 695 problems (in total 2,382 documents), equally
distributed regarding true/false authorships. A problem pi forms a tuple (DAi ; DAi ?),
where DAi denotes the training set of Ai and DAi ? the questioned document, which is
(or not) written by Ai. Each problem belongs to one of four languages (Greek, Spanish,
Greek and Spanish and to one of four genres (Essays, Reviews, Novels and Articles).
For simplification reasons, CPAN-14 is divided into six subcorpora and thus, can be
formulated as CPAN-14 = fCDE; CDR; CEE; CEN ; CGR; CSPg. This makes it easier to treat each
subcorpus independently (e.g. in terms of parameterizations). The full name of each
C 2 CPAN-14 is given in Table 2.</p>
      <p>For our experiments we used a 80% portion of CPAN-14 for training/parameter
learning (denoted by CPAN-Train), while the remaining 20% portion was used for testing
(denoted by CPAN-Test). Regarding CPAN-Test we used the same structure as CPAN-14 such that
CPAN-Test = fCDE; CDR; CEE; CEN ; CGR; CSPg holds, where the number of problems in each
C 2 CPAN-Test equals 20% of the problems in each C 2 CPAN-Train. The concrete statistic
including the distribution of true/false authorships is given in Table 3.
In this section we carry out our evaluation regarding the CPAN-14 = CPAN-Train [ CPAN-Test
corpus. First we explain which performance measures were used and secondly, how
2 The corpus can be downloaded from: http://pan.webis.de (accessed on June 26,
2014).
(7)
(8)
(9)
(10)
the most important parameters were learned from CPAN-Train. Finally, we evaluate our
approach on CPAN-Test.
In order to evaluate our approach, we used several performance measures, sharing the
following variables:
m = Number of problems in C
mc = Number of correct answers per C
mu = Number of unanswered problems answers per C</p>
      <p>The first performance measure is Accuracy =
mance measure we first need to define two terms:
mc , whereas for the second
perform
m1 Xm i ,
i=1
1
m
AUC =
mc +
mu
mc
m</p>
      <p>Here, i denotes the probability score regarding its corresponding problem pi, which
was defined in section 4. With these two measures we define the second performance
measure: AUC c@1. Note that for parameter learning from the CPAN-Train corpus we
only used the Accuracy measure, as (in our opinion) this measure is better interpretable.
5.2</p>
      <p>Experiment I: Finding an optimal for the Minkowski distance function
The intention behind this experiment was to find an optimal parameter, used by the
Minkowski distance function. Since has a strong influence on the classification result it
must be well-chosen, in order to generalize across the range of all involved corpora and
all feature sets. To achieve this generalization we merged all training subcorpora, such
that CPAN-Train = CDE [ CDR [ CEE [ CEN [ CGR [ CSP holds. Afterwards, we applied
VEBAV on CPAN-Train, where as an input we used all mentioned feature sets in Table 1 and
the following 14 predefined values f0:2; 0:4; 0:6; 0:8; 1; 2; ...; 10g. We constructed
from the results a table, where the rows represent the feature sets F1; F2; : : : ; F14, while
the columns represent the 14 values. Next, we derived a row from this table that
includes the medians regarding all columns. This row is illustrated in Figure 1. As can
be seen in Figure 1, an optimal range for is [0.2; 1], where 0:6 seems to be the most
promising one in terms of robustness, among all involved feature sets. As a consequence
of this analysis, we decided to use = 0:6 for the further experiments.
5.3</p>
      <p>Experiment II: Determinig the classification stregth of all feature sets
In this experiment we wanted to compare the classification stregth of all involved feature
sets. For this we applied VEBAV on F1; F2; : : : ; F14 regarding the six subcorpora in
CPAN-Train, where this time we used the fixed setting = 0:6. From the resulting table
(rows = feature sets, columns = subcorpora) we calculated for each row (containing
100
90
80
70
60
50
40
30
20
10
0</p>
      <p>56,42 56,25
the classification results regarding all six subcorpora) the median, which gave us a new
column, illustrated in Figure 2. Here, it can be seen that the majority of all feature
sets are more or less equally strong, excepting F10 and F11, which seem to be useless
for VEBAV (at least for CPAN-Train). Furthermore, it can be observed that using mixed
feature sets lead to slightly better classification results, compared with the majority of
the non-mixed feature sets.</p>
      <p>In order to get a better picture of how VEBAV performs without considering the
medians among the six subcorpora, we show in Figure 3 a comparison of the top three
performing feature sets for each individual subcorpus. One interesting observation that
can be concluded from Figure 3 is that the feature set F1 (characters) is almost as strong
as the mixed feature sets, which involve more sophisticated features (such as character
n-grams or tokens). This shows that by using only characters as features, it is possible
to verify authorships with a classification result, which is even better as a random guess
(50%). Another observation, which is worth to be mentioned, is the fact that the greek
corpus seems to contain problems that are more difficult to judge, compared to the
problems in the other subcorpora, where the classification results are obviously better.
100
90
80
70
60
50
40
30
20
10
0</p>
      <p>52,5
DE</p>
      <p>DR</p>
      <p>EE</p>
      <p>EN</p>
      <p>GR</p>
      <p>We believe that this is not a language-based issue, but more an effort of the organizers
to make the task more challenging, as this was also the case in PAN-2013 [2].</p>
      <p>Experiment III: Single feature sets vs. feature set combinations
In this experiment we were curious to know, if using combinations of feature sets by
applying majority-voting can outperform classifications based only on single feature
sets. As a setup for this experiment, we picked out the six most promising feature sets
fF1; F2; F5; F12; F13; F14g and used them to construct a power set P, which includes
26 = 64 feature set combinations. Next, we removed those subsets Fcombi 2 (P n ;)
comprising of an even number of feature set combinations, to enable a fair (non-random
based) majority-voting and also to speed the classification process up a little by avoiding
unnecessary runs. This led to 25 = 32 suitable combinations Fcomb1; Fcomb2; ...; Fcomb32,
where we applied each Fcomb as an input for VEBAV regarding CPAN-Train. Next, we stored
all combinations and their corresponding classification results in a list (sorted in
descending order) and selected the top five combinations, listed in Table 4. Unfortunately,
it can be observed from the comparison between Table 4 and Figure 2 that applying
majority-voting on feature set combinations gives only negligible improvements (
12%) for the most cases. When focussing on F12 in Figure 2 it can be even seen that its
median accuracy of 62:50% outperfroms any sort of feature set combination. One
possible reason for the unsatisfactory results could be the fact that many single feature sets
made identical ternary decisions (Yes, No, Unanswered), such that applying
majorityvoting is only effectively in few cases. We observed this phenomenon several times by
stepping through the code, using the debugger. Hence, we do not consider the usage of
feature set combinations for further evaluations in this paper.
5.5</p>
      <p>Experiment IV: Obtaining corpus dependent parameters
Due to the fact that the classification scores regarding the Experiments I-III were
relatively low, we decided in this experiment to learn individual parameters from each
corpus C 2 fCDE; CDR; CEE; CEN ; CGR; CSPg and to thereby improve the classification results.
For this we first applied VEBAV on each C to obtain individual scores. Since there
where six corpora, we constructed six tables, where the rows denote F1; F2; ...; F14 and
the coloumns the 14 values. Then, we picked those six tuples (Fi; j ), which led
to the maximum accuracy score in each table. These tuples are listed in Table 5. One
can see here that it definitely make sense to use corpus-dependent (or more precisely
language/genre-dependent) parameters, instead of using a global setting. However, the
price for better results may be expensive in terms of overfitting.
In this section we evaluate VEBAV on the test set CPAN-Test, where we used all relevant
information, learned from the prior experiments. For a better overview we divide the
evaluations into three subsections, where we first show how VEBAV performed with
generalized parameters then how it performed with individual parameters and finally
how it performed regarding the runtime.</p>
      <p>Evaluation results regarding generalized parameters. For the first evaluation we set
as an input for VEBAV the generalized parameters = 0:6 and F12 that were learned
from Experiments I-II. The results are given in Table 6. As can be observed from this
table, using generalized parameters seems not to be the best choice, as achieving
optimal results was possible for only one subcorpus, while for two subcorpora the results
where even lower than a random guess. Moreover, it can be seen that the AUC c@1
scores are very low, which are not only caused by the low accuracies themselves, but</p>
      <p>C
also by an inappropriate calculation of , performed through linear scaling. However,
further investigations with other scaling methods must show if this hypothesis is valid.
Evaluation results regarding individual parameters. In the second evaluation we
used the individual parameters, learned in Experiment IV. The results are given in Table
7. By looking on the results in this table, we can conclude once again that using an
indiC
CDE (F12; 0:6)
CDR (F12; 0:8)
CEE (F12; 1)
CEN (F3; 0:6)
CGR (F3; 0:2)
CSP (F7; 1)
80% 0; 40248
55% 0; 30569
55% 0; 25801125
80% 0; 4146
70% 0; 362425
55% 0; 28072275
vidual parameter setting over a global setting is much more promising. Thus, individual
parameter settings should be (among other things) the subject for further investigations,
when applying VEBAV on other corpora beyond those of PAN.</p>
      <p>Evaluation results regarding runtime. In the third and last evaluation we applied the
entire PAN corpus CPAN-14 = CPAN-Train [ CPAN-Test on VEBAV, where we considered only
the runtime needed for the clasification, rather than the clasification results. Regarding
this evaluation we used a laptop with the following configuration: Intel R CoreTM
i53360M processor (2.80 GHz), 16GB RAM, 256GB SSD hard drive.</p>
      <p>The runtime (measured in milliseconds) for each feature set and each subcorpus is
given in Table 8. As can be seen in the Median column in Table 8, the fastest
classification was performed with the feature set F3 (punctuation marks). The reason for this is
that F3 leads to very few feature-lookups ( 20 per document), such that IO-accesses
are negligible. In contrast to this, F1 leads to the highest runtime (excepting F12 and
F13 since they are mixtures of existing sets), due to the fact that it requires an iteration
over each character in a text. Another interesting observation is that token-based
features (F8 11) also require very low runtime. This is explained by the fact that there are
much less tokens in texts than characters and thus, the number of lookups is also very
limited.</p>
      <p>The overall runtime for CPAN-14, without considering the subcorpora, is given in
Figure 4. An interesting observation that can be made by comparing Figure 4 to
Fig13230</p>
      <p>20808
1019 2147 2423</p>
      <p>F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14
ure 2 is that F5 can be seen as an optimal candidate when a trade-off between a high
classification result and low runtime must be taken into account. Here, F5 achieves a
classification result of 59; 96% by requiring only 2,423 milliseconds. F8 behaves in a
similar way (55; 56% accuracy by requiring only 1,830 milliseconds).
70000
60000
50000
40000
20000
10000</p>
      <p>0
30000 27303
21708
46140
58424</p>
      <p>8048</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion &amp; future work</title>
      <p>In this paper we presented a simple, scalable and fast authorship verification scheme
for the Author Identification task within the PAN-2014 competition. Our method
provides a number of benefits as for example that it is able to handle (even very short) texts
written in several languages, across different kinds of genre. Besides this, the method is
independent of linguistic resources such as ontologies, thesauruses, language models,
etc. A further benefit is the low runtime of the method, since there is no need for deep
linguistic processing like POS-tagging, chunking or parsing. Another benefit is that the
involved components within the method can be replaced easily as for example the
distance function (required for classification), the acceptance-threshold or the feature sets
including their parameters. Moreover, the classification itself can be modified easily,
e.g. by using an ensemble of several distance functions.</p>
      <p>Unfortunately, besides benefits our approach has several pitfalls too. One of the
biggest challenges, for example, is the inscrutability of the methods parameter-space,
due to the fact that the number of possible configuration settings is near infinite. Such
settings include for instance the parameter for the involved distance function, the
values for the n and k parameters, the weight (!) and tolerance ( ) parameters that
influence the classification quality but also other options such as ` (the number of chunks) or
the used feature normalization strategy. Due to the complexity of our scheme we could
only perform a small number of experiments to obtain at least an optimal and the
most promising feature sets. This, however, was done only for the official PAN corpus
but not for other corpora. Thus, it had not been proved if the learned parameters are able
to perform satisfactorily on other corpora. Another challenge that remains unsolved is
how to optimize the probability score , determined in the decision determination step,
as this value has also a strong influence on the resulting AUC c@1 scores, which in our
case are very low. Hence, further tests are essential for the applicability of our scheme.
Acknowledgments. This work was supported by the CASED Center for Advanced
Security Research Darmstadt, Germany funded by the German state government of
Hesse under the LOEWE program (www.CASED.de).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cha</surname>
            ,
            <given-names>S.H.</given-names>
          </string-name>
          :
          <article-title>Comprehensive survey on distance/similarity measures between probability density functions</article-title>
          .
          <source>International Journal of Mathematical Models and Methods in Applied Sciences</source>
          <volume>1</volume>
          (
          <issue>4</issue>
          ),
          <fpage>300</fpage>
          -
          <lpage>307</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Overview of the Author Identification Task at PAN 2013</article-title>
          . In P. Forner,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          , and D. Tufis (eds)
          <article-title>CLEF 2013 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Authorship Verification as a One-Class Classification Problem</article-title>
          .
          <source>In: Proceedings of the twenty-first international conference on Machine learning</source>
          . pp.
          <fpage>62</fpage>
          -.
          <source>ICML '04</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A Survey of Modern Authorship Attribution Methods</article-title>
          .
          <source>J. Am. Soc. Inf. Sci. Technol</source>
          .
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>Mar 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tax</surname>
            ,
            <given-names>D.M.J.</given-names>
          </string-name>
          :
          <string-name>
            <surname>One-Class Classification</surname>
          </string-name>
          .
          <source>Concept Learning In the Absence of Counter-Examples. Ph.D. thesis</source>
          , Delft University of Technology (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>