VEBAV - A Simple, Scalable and Fast Authorship Verification Scheme Notebook for PAN at CLEF 2014 Oren Halvani? and Martin Steinebach Fraunhofer Institute for Secure Information Technology SIT Rheinstrasse 75, 64295 Darmstadt, Germany {FirstName.LastName}@SIT.Fraunhofer.de Abstract We present VEBAV - a simple, scalable and fast authorship verification scheme for the Author Identification (AI) task within the PAN-2014 competition. VEBAV (VEctor-Based Authorship Verifier), which is a modification of our exist- ing PAN-2013 approach, is an intrinsic one-class-verification method, based on a simple distance function. VEBAV provides a number of benefits as for instance the independence of linguistic resources and tools like ontologies, thesauruses, language models, dictionaries, spellcheckers, etc. Another benefit is the low run- time of the method, due to the fact that deep linguistic processing techniques like POS-tagging, chunking or parsing are not taken into account. A further benefit of VEBAV is the ability to handle more as only one language. More concretely, it can be applied on documents written in Indo-European languages such as Dutch, English, Greek or Spanish. Regarding its configuration VEBAV can be extended or modified easily by replacing its underlying components. These include, for in- stance the distance function (required for classification), the acceptance criterion, the underlying features including their parameters and many more. In our exper- iments we achieved regarding a 20%-split of the PAN 2014 AI-training-corpus an overall accuracy score of 65,83% (in detail: 80% for Dutch-Essays, 55% for Dutch-Reviews, 55% for English-Essays, 80% English-Novels, 70% for Greek- Articles and 55% for Spanish-Articles). Keywords: Authorship verification, one-class-classification 1 Introduction Authorship Verification (AV ) is a sub-discipline of Authorship Analysis [4, Page: 3], which itself is an integral part of digital text forensics. It can be applied in many forensic scenarios as for instance checking the authenticity of written contracts, threats, insults, testaments, etc. where the goal of AV remains always the same: Verify if two docu- ments DA and DA? are written by the same author A, or not. An alternative reformula- tion of the goal is to verify the authorship of DA? , given a set of sample documents of A. From a Machine Learning perspective AV clearly forms an one-class-classification problem [3], due to the fact that A is the only target class to be distinguished among all other possible classes (authors), where their number can be theoretically infinite. ? Corresponding author. 1049 In order to perform AV at least four components are mandatorily required: – The document DA? , which should be verified regarding its alleged authorship. – A training set DA = {D1A , D2A , ...}, where each DiA represents a sample docu- ment of A. – A set of features F = {f1 , f2 , ...}, where each fj (style marker) should help to model the writing style of DA? and each DiA ∈ DA . – At least one classification method, which accepts or rejects the given authorship based on F and a predefined or dynamically determined threshold θ. The aim of this paper is to provide a simple, scalable and fast AV scheme, which of- fers a number of benefits as for instance promising detection rates, easy implementation, low runtime, independence of language or linguistic resources as well as easy modifia- bility and expandability. Our proposed AV scheme, denoted by VEBAV, is based on our earlier approach regarding the PAN 2013 AI-task, which itself formed a modification of the Nearest Neighbor (NN) one-class classification technique, described by Tax in [5, Page: 69]. In a nutshell, VEBAV takes as an input a set of sample documents of a known author (DA ) and one document of an unknown author (DA? ). All documents in DA are first concatenated to a big document which is then splitted again into (near) eqal-sized chunks (such that DA includes now only these chunks). Afterwards, feature vectors are constructed from each DiA ∈ DA and from DA? according to preselected feature sets. Next, a representative is selected among the training feature vectors (based on two op- tions), which is important to determine the decision regarding the alleged authorship. In the next step distances are calculated between the representative and the remaining training feature vectors, as well as between the representative and the test feature vector. Depending on all calculated distances a twofold decision regarding the alleged author- ship is generated, which includes a ternary decision δ ∈ {Yes, No, Unanswered} and a probability score ρ that describes the soundness of δ. The rest of this paper is structured as follows. In the following section we explain how we extract features from the corresponding linguistic layers and which features have been used in our approach. Afterwards, we describe in section 3 our verification scheme. In section 4 we present the corpus that was used to evaluate our scheme, while in section 5 we show our evaluation results. Finally, we draw our conclusions and the challenges we were faced in section 6 and provide some ideas for future work. 2 Features Features (style markers) are the core of AV , since they are able to approximate writing styles and thus, can help to judge automatically if two texts have a very similar style. If this holds, it is an indicator that both texts could be written by the same author. In the next subsections we explain which sources exactly quantifiable features can be retrieved from, which tools are required for this extraction process and also what kind of feature sets we used in our approach. 1050 2.1 Linguistic layers From a linguistic point of view, features are extracted from so-called linguistic layers that can be understood as abstract units within a text. In summary, the most important linguistic layers are the following: – Phoneme layer: This layer includes phoneme-based features as for example vow- els, consonants or also the more sophisticated supra-segmental or prosodic features. Such features can typically be won out of texts by using (pronouncing) dictionaries (e.g. IPA). – Character layer: This layer includes character-based features as for instance pre- fixes, suffixes or letter n-grams, which typically are extracted from texts via regular expressions. – Lexical layer: This layer includes token-based features as for instance function words or POS-Tags (Part-Of-Speech Tags). These features can be extracted from texts via tokenizers (which often are based on simple regular expressions). – Syntactic layer: This layer includes syntax-based features as for instance con- stituents (e.g. nominal phrases) or collocations. Such features can be extracted by sophisticated regular expressions or by natural language processing tools (e.g. POS- Tagger). However, the latter one is normally bounded to a specific language model and thus, cannot scale to multiple languages. Besides this the runtime of natural language processing tools is much more higher as the runtime caused by pattern matching via regular expressions. – Semantic Layer: This layer includes semantic-based features, e.g. semantic rela- tions (hyponymous, synonymys, meronyms, etc.). Such features require deep lin- guistic processing, which often rely on external knowledge resources (e.g. Word- Net) or complex tools (e.g. parsers, named entity eecognizers, etc.). In VEBAV we only make use of the Character, Lexical and Syntactic linguistic layers, due to the effectiveness of their underlying features and the low runtime, caused by the feature extraction process via regular expressions. 2.2 Feature sets In this paper we use the term feature set, denoted by F , which represents features be- longing to one or more linguistic layers. In Table 1 we list 14 feature sets that have been used in our experiments. Here, one should pay attention that F12 , F13 and F14 form mixtures of existing feature sets. The idea behind it was to see if such mixtures can outperform1 single feature sets when it comes to classification. 1 This was in fact the case, as can be observed in the later experiments (section 5.3). 1051 Table 1. All feature sets used in our approach. Fi Feature set Description Examples F1 Characters All kind of characters {a,b,1,8,#,<,%,!,. . .} F2 Letters All kind of letters {a,b,α, β,ä,ß,ó,á,ñ,. . .} F3 Punctuation marks Symbols that structure sentences {.,:,;-,",(,),. . .} F4 Word k Prefixes The k starting letters of words example {e,ex,exa,exam,. . .} F5 Word k Suffixes The k ending letters of words example {e,le,ple,mple,. . .} F6 Character n-grams Overlapping character-fragments ex-ample {ex-,x-a,-am,. . .} F7 Letter n-grams Overlapping letter-fragments ex-ample {exa,xam,amp,. . .} F8 Tokens Segmented character-based units A [sample] text! {A,[sample],text!} F9 Words Segmented letter-based units A [sample] text! {A,sample,text} F10 Token n-grams Overlapping token-fragments Wind and rain! {Wind and, and rain!} F11 Word n-grams Overlapping word-fragments Wind and rain! {Wind and, and rain} F12 Mix1 A mix of three feature sets F1 ∪ F 3 ∪ F6 F13 Mix2 A mix of three feature sets F1 ∪ F 2 ∪ F5 F14 Mix3 A mix of four feature sets F3 ∪ F 4 ∪ F5 ∪ F 8 2.3 Parameters As can be seen in Table 1, six feature sets can be parameterized by n (n-gram sizes) and/or k (length of prefixes/suffixes). It should be emphasized that adjusting these two parameters can influence the results massively. Hence, we generalized both settings by using constant values (n = k = 3), learned from earlier experiments, which turned out to be reliable also in the current experiments among the diverse PAN subcorpora. 3 Proposed Verification Scheme In this section we give a detailed description of our AV scheme VEBAV. For overview reasons we divided the entire workflow of the algorithm into six subsections, where we first explain what kind of preprocessing we perform on the data. The other five subsections focus on the algorithm itself. 3.1 Preprocessing In contrast to our approach in PAN-2013 we decided in our current approach neither to apply normalization nor noise reduction techniques. Instead, we treat each text as it is, which turned out to be not only less burdensome but also promising. Our only pre- processing mechanism is restricted to concatenate all DiA ∈ DA to a single document DA which, depending on the length, is splitted again into ` (near) equal-sized chunks D1A , D2A , ..., D`A . Note that in our experiments ` is statically set to five chunks if, and only if, the length of DA is above 15,000 characters, otherwise ` is statically set to three chunks. The reason for this decision is that we observed quite better results in contrast by using only one fixed ` for both, short or long texts. 3.2 Vocabulary Generation In order the form a basis for the construction of feature vectors, we need to build a global feature vocabulary, denoted by V. But beforehand, we first need to select at least 1052 one (appropriate) feature set F , to know which sort of features should be taken into account. Here, appropriate refers to features that... – ... are able to model high similarity between DA? and DA (for the case A? = A). – ... are able to discriminate well between the writing style of A and all other possible authors (for the case A? 6= A). – ... appear in all generated vocabularies. Afterwards, we construct from DA? and each chunk DiA ∈ DA the correspond- ing feature vocabularies VDA? and VD1A , VD2A , ..., VD` A . As a last step we apply an intersection among all constructed vocabularies to build the global vocabulary V: V = VDA? ∩ VD1A ∩ VD2A ∩ ... ∩ VD` A (1) Note that in VEBAV we consider case-sensitive features, such that word ∈ V and Word ∈ V are both valid. 3.3 Constructing feature vectors Once V is generated, the next step is to construct the feature vectors F1A , F2A , ..., F`A from each DiA ∈ DA and FA? from DA? . The construction process regarding these vectors is very straightforward. We look up for each fi ∈ V how often it occurs in some document D and denote its absolute frequency by αi . Next, we normalize αi by the length of D, to obtain its relative frequency, which represents a number xi ∈ [0 ; 1]. Hence, the feature vector representation regarding D is formally defined by: F = (x1 , x2 , ..., xn ) , with n = |V| . (2) 3.4 Representative Selection After constructing all feature vectors, a representative (training-) feature vector Frep must be selected. This step is essential for the later determination of the decision re- garding the alleged authorship. In general, VEBAV offers two options to select Frep : 1. Selecting Frep dynamically by using a similarity function (e.g. cosine similarity) between all training feature vectors. Here, Frep is selected according to the feature vector, who is mostly dissimilar from the others. In other terms Frep can be under- stand as an outlier. 2. Selecting Frep manually (static). Note that in our experiments we chose the first option. 3.5 Distances Calculations In this step we calculate the distances, needed for the decision determination regard- ing the alleged authorship. Concretely, we calculate the distances d1 , d2 , ..., d` between Frep and each F1A , F2A , ..., F`A as well as the the distance d? between Frep and FA? . 1053 The calculation of these distances requires a predefined distance function dist(X, Y ) with X = Frep and Y ∈ {F1A , F2A , ..., F`A }. We implemented a broad range of dis- tance functions in VEBAV, where the majority have been taken from [1]. However, for complexity reasons we used only the Minkowski distance function in our experiments, which is defined formally as: n X  λ1 dist(X, Y ) = |xi − yi |λ , with λ ∈ R+ \ {0} , (3) i=1 As a last step we calculate the average regarding the distances between Frep and each F1A , F2A , ..., F`A as follows: 1 d∅ = (d1 + d2 + . . . + d` ) (4) ` 3.6 Decision Determination The goal of this step is to construct a twofold decision regarding the alleged authorship, which consists of a ternary decision δ ∈ {Yes, No, Unanswered} and a probability score ρ. In order to calculate these terms both values are required d? and d∅ . For the latter one we use the following adjusted form: d 0 = d∅ + (ω · τ ), where ω denotes a weight and τ a tolerance parameter, calculated from the a standard deviation of d1 , d2 , ..., d` . The idea behind ω and τ is to cope with the presence of noisy writing styles, that might be the result of mixing different sample documents of A together in his training set DA . Note that in our experiments we chose a neutral weight (τ = 1), since we could not investigate an optimal value for it (even adjustig it by 0.1 can cause a massive drop regarding the classification results). With these we first define the probability score as: 1 ρ= (5) 1 + dd?0 Next, we define the ternary decision as follows:  Yes,  d? < d 0 δ = No, d? > d 0 (6) Unanswered, (d? = d 0 ) ∨ (|d? − d 0 | < ε)   The concrete semantic behind δ is: – Yes: VEBAV accepts the alleged author as the true author (A? = A). – No: VEBAV rejects the alleged author as the true author (A? 6= A). – Unanswered: VEBAV is unable to generate a prediction because d? and d 0 are equal/near-equal or due to another unexpected result. In any case ρ is set to 0.5. Note that depending on how ε was chosen (regarding the case that d? and d 0 are near-equal) the number of Unanswered decisions, in the context of a corpus evalu- ation, can vary considerably. In our experiments we chose ε = 0, 001 as this small value restricts the number of Unanswered decisions, where ρ is near 0.5. 1054 4 Used Corpora Regarding our experiments we used the official PAN-2014 Author Identification train- ing corpus, denoted by CPAN-14 , which has been released2 by the PAN organizers on April 22, 2014. CPAN-14 consists of 695 problems (in total 2,382 documents), equally distributed regarding true/false authorships. A problem pi forms a tuple (DAi , DAi ? ), where DAi denotes the training set of Ai and DAi ? the questioned document, which is (or not) written by Ai . Each problem belongs to one of four languages (Greek, Spanish, Greek and Spanish and to one of four genres (Essays, Reviews, Novels and Articles). For simplification reasons, CPAN-14 is divided into six subcorpora and thus, can be for- mulated as CPAN-14 = {CDE , CDR , CEE , CEN , CGR , CSP }. This makes it easier to treat each subcorpus independently (e.g. in terms of parameterizations). The full name of each C ∈ CPAN-14 is given in Table 2. Table 2. Full names of all subcorpora within the CPAN-14 corpus. CDE : Dutch-Essays CEE : English-Essays CGR : Greek-Articles CDR : Dutch-Reviews CEN : English-Novels CSP : Spanish-Articles For our experiments we used a 80% portion of CPAN-14 for training/parameter learn- ing (denoted by CPAN-Train ), while the remaining 20% portion was used for testing (de- noted by CPAN-Test ). Regarding CPAN-Test we used the same structure as CPAN-14 such that CPAN-Test = {CDE , CDR , CEE , CEN , CGR , CSP } holds, where the number of problems in each C ∈ CPAN-Test equals 20% of the problems in each C ∈ CPAN-Train . The concrete statistic including the distribution of true/false authorships is given in Table 3. Table 3. Statistics of CPAN-Test . C |C| Distribution of true/false authorships CDE 20 9 true cases / 11 false cases CDR 20 11 true cases / 9 false cases CEE 40 2 true cases / 38 false cases CEN 20 10 true cases / 10 false cases CGR 20 11 true cases / 9 false cases CSP 20 10 true cases / 10 false cases 5 Evaluation In this section we carry out our evaluation regarding the CPAN-14 = CPAN-Train ∪ CPAN-Test corpus. First we explain which performance measures were used and secondly, how 2 The corpus can be downloaded from: http://pan.webis.de (accessed on June 26, 2014). 1055 the most important parameters were learned from CPAN-Train . Finally, we evaluate our approach on CPAN-Test . 5.1 Performance measures In order to evaluate our approach, we used several performance measures, sharing the following variables: m = Number of problems in C (7) mc = Number of correct answers per C (8) mu = Number of unanswered problems answers per C (9) mc The first performance measure is Accuracy = , whereas for the second perfor- m mance measure we first need to define two terms: m 1 X 1    m  c AUC = ρi , c@1 = mc + mu · (10) m i=1 m m Here, ρi denotes the probability score regarding its corresponding problem pi , which was defined in section 4. With these two measures we define the second performance measure: AUC · c@1. Note that for parameter learning from the CPAN-Train corpus we only used the Accuracy measure, as (in our opinion) this measure is better interpretable. 5.2 Experiment I: Finding an optimal λ for the Minkowski distance function The intention behind this experiment was to find an optimal λ parameter, used by the Minkowski distance function. Since λ has a strong influence on the classification result it must be well-chosen, in order to generalize across the range of all involved corpora and all feature sets. To achieve this generalization we merged all training subcorpora, such that CPAN-Train = CDE ∪ CDR ∪ CEE ∪ CEN ∪ CGR ∪ CSP holds. Afterwards, we applied VEBAV on CPAN-Train , where as an input we used all mentioned feature sets in Table 1 and the following 14 predefined λ values {0.2, 0.4, 0.6, 0.8, 1, 2, ..., 10}. We constructed from the results a table, where the rows represent the feature sets F1 , F2 , . . . , F14 , while the columns represent the 14 λ values. Next, we derived a row from this table that includes the medians regarding all columns. This row is illustrated in Figure 1. As can be seen in Figure 1, an optimal range for λ is [0.2; 1], where 0.6 seems to be the most promising one in terms of robustness, among all involved feature sets. As a consequence of this analysis, we decided to use λ = 0.6 for the further experiments. 5.3 Experiment II: Determinig the classification stregth of all feature sets In this experiment we wanted to compare the classification stregth of all involved feature sets. For this we applied VEBAV on F1 , F2 , . . . , F14 regarding the six subcorpora in CPAN-Train , where this time we used the fixed setting λ = 0.6. From the resulting table (rows = feature sets, columns = subcorpora) we calculated for each row (containing 1056 100 90 80 70 60 56,42 56,25 50 40 30 20 10 0 Figure 1. Comparison of different λ values for the Minkowski distance function. the classification results regarding all six subcorpora) the median, which gave us a new column, illustrated in Figure 2. Here, it can be seen that the majority of all feature sets are more or less equally strong, excepting F10 and F11 , which seem to be useless for VEBAV (at least for CPAN-Train ). Furthermore, it can be observed that using mixed feature sets lead to slightly better classification results, compared with the majority of the non-mixed feature sets. 100 90 80 70 60 50 40 30 20 10 0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 Medians 60,07 58,98 52,19 52,57 59,96 56,88 56,25 56,56 56,25 13,39 15,36 62,50 59,44 61,55 Figure 2. Comparison of the 14 feature sets across all training corpora (using λ = 0.6). In order to get a better picture of how VEBAV performs without considering the medians among the six subcorpora, we show in Figure 3 a comparison of the top three performing feature sets for each individual subcorpus. One interesting observation that can be concluded from Figure 3 is that the feature set F1 (characters) is almost as strong as the mixed feature sets, which involve more sophisticated features (such as character n-grams or tokens). This shows that by using only characters as features, it is possible to verify authorships with a classification result, which is even better as a random guess (50%). Another observation, which is worth to be mentioned, is the fact that the greek corpus seems to contain problems that are more difficult to judge, compared to the problems in the other subcorpora, where the classification results are obviously better. 1057 100 90 80 71,25 70 70 64,47 F1 60,76 61,25 60 52,5 50 F12 40 30 F14 20 10 0 DE DR EE EN GR SP Figure 3. Comparison of the top three performing feature sets (without applying medians among the six subcorpora). We believe that this is not a language-based issue, but more an effort of the organizers to make the task more challenging, as this was also the case in PAN-2013 [2]. 5.4 Experiment III: Single feature sets vs. feature set combinations In this experiment we were curious to know, if using combinations of feature sets by applying majority-voting can outperform classifications based only on single feature sets. As a setup for this experiment, we picked out the six most promising feature sets {F1 , F2 , F5 , F12 , F13 , F14 } and used them to construct a power set P, which includes 26 = 64 feature set combinations. Next, we removed those subsets Fcomb i ∈ (P \ ∅) comprising of an even number of feature set combinations, to enable a fair (non-random based) majority-voting and also to speed the classification process up a little by avoiding unnecessary runs. This led to 25 = 32 suitable combinations Fcomb 1 , Fcomb 2 , ..., Fcomb 32 , where we applied each Fcomb as an input for VEBAV regarding CPAN-Train . Next, we stored all combinations and their corresponding classification results in a list (sorted in de- scending order) and selected the top five combinations, listed in Table 4. Unfortunately, it can be observed from the comparison between Table 4 and Figure 2 that applying majority-voting on feature set combinations gives only negligible improvements (≈ 1- 2%) for the most cases. When focussing on F12 in Figure 2 it can be even seen that its median accuracy of 62.50% outperfroms any sort of feature set combination. One pos- sible reason for the unsatisfactory results could be the fact that many single feature sets made identical ternary decisions (Yes, No, Unanswered), such that applying majority- voting is only effectively in few cases. We observed this phenomenon several times by stepping through the code, using the debugger. Hence, we do not consider the usage of feature set combinations for further evaluations in this paper. 5.5 Experiment IV: Obtaining corpus dependent parameters Due to the fact that the classification scores regarding the Experiments I-III were rela- tively low, we decided in this experiment to learn individual parameters from each cor- pus C ∈ {CDE , CDR , CEE , CEN , CGR , CSP } and to thereby improve the classification results. 1058 Table 4. Top five performing feature set combinations. Fcomb Accuracy {F1 , F2 , F5 , F12 , F14 } 61, 8% {F3 , F12 , F13 } 61, 62% {F1 , F2 , F5 , F13 , F14 } 61, 62% {F2 , F5 , F12 , F13 , F14 } 61, 26% {F1 , F3 , F12 } 61, 26% For this we first applied VEBAV on each C to obtain individual λ scores. Since there where six corpora, we constructed six tables, where the rows denote F1 , F2 , ..., F14 and the coloumns the 14 λ values. Then, we picked those six tuples (Fi , λj ), which led to the maximum accuracy score in each table. These tuples are listed in Table 5. One Table 5. Most promising tuple for each individual training subcorpus. C (F i , λj ) Accuracy CDE (F12 , 0.6) 73, 68% CDR (F12 , 0.8) 60, 76% CEE (F12 , 1) 63, 75% CEN (F3 , 0.6) 75% CGR (F3 , 0.2) 65% CSP (F7 , 1) 71, 25% can see here that it definitely make sense to use corpus-dependent (or more precisely language/genre-dependent) parameters, instead of using a global setting. However, the price for better results may be expensive in terms of overfitting. 5.6 Evaluation results for the test set In this section we evaluate VEBAV on the test set CPAN-Test , where we used all relevant information, learned from the prior experiments. For a better overview we divide the evaluations into three subsections, where we first show how VEBAV performed with generalized parameters then how it performed with individual parameters and finally how it performed regarding the runtime. Evaluation results regarding generalized parameters. For the first evaluation we set as an input for VEBAV the generalized parameters λ = 0.6 and F12 that were learned from Experiments I-II. The results are given in Table 6. As can be observed from this table, using generalized parameters seems not to be the best choice, as achieving opti- mal results was possible for only one subcorpus, while for two subcorpora the results where even lower than a random guess. Moreover, it can be seen that the AUC · c@1 scores are very low, which are not only caused by the low accuracies themselves, but 1059 Table 6. Results regarding CPAN-Test , using generalized parameters. C Accuracy AUC·c@1 CDE 80% 0, 40248 CDR 45% 0, 2445525 CEE 65% 0, 3147625 CEN 45% 0, 2309625 CGR 50% 0, 262025 CSP 55% 0, 2602875 also by an inappropriate calculation of ρ, performed through linear scaling. However, further investigations with other scaling methods must show if this hypothesis is valid. Evaluation results regarding individual parameters. In the second evaluation we used the individual parameters, learned in Experiment IV. The results are given in Table 7. By looking on the results in this table, we can conclude once again that using an indi- Table 7. Results regarding CPAN-Test , using individual parameters. C (F i , λj ) Accuracy AUC·c@1 CDE (F12 , 0.6) 80% 0, 40248 CDR (F12 , 0.8) 55% 0, 30569 CEE (F12 , 1) 55% 0, 25801125 CEN (F3 , 0.6) 80% 0, 4146 CGR (F3 , 0.2) 70% 0, 362425 CSP (F7 , 1) 55% 0, 28072275 vidual parameter setting over a global setting is much more promising. Thus, individual parameter settings should be (among other things) the subject for further investigations, when applying VEBAV on other corpora beyond those of PAN. Evaluation results regarding runtime. In the third and last evaluation we applied the entire PAN corpus CPAN-14 = CPAN-Train ∪ CPAN-Test on VEBAV, where we considered only the runtime needed for the clasification, rather than the clasification results. Regarding TM this evaluation we used a laptop with the following configuration: Intel R Core i5- 3360M processor (2.80 GHz), 16GB RAM, 256GB SSD hard drive. The runtime (measured in milliseconds) for each feature set and each subcorpus is given in Table 8. As can be seen in the Median column in Table 8, the fastest classifica- tion was performed with the feature set F3 (punctuation marks). The reason for this is that F3 leads to very few feature-lookups (≈ 20 per document), such that IO-accesses are negligible. In contrast to this, F1 leads to the highest runtime (excepting F12 and F13 since they are mixtures of existing sets), due to the fact that it requires an iteration over each character in a text. Another interesting observation is that token-based fea- tures (F8−11 ) also require very low runtime. This is explained by the fact that there are 1060 much less tokens in texts than characters and thus, the number of lookups is also very limited. Table 8. Runtime needed by VEBAV for the classification of CPAN-14 (considering each subcorpus). C DE C DR C EE C EN C GR C SP Median F1 1, 015 148 5, 361 6, 358 7, 279 7, 395 5, 860 F2 775 113 4, 083 5, 891 5, 761 5, 819 4, 922 F3 44 7 235 359 150 307 193 F4 55 6 358 623 445 502 402 F5 98 6 396 675 470 569 433 F6 356 17 2, 083 2, 963 3, 019 4, 457 2, 523 F7 532 29 3, 349 4, 612 4, 512 6, 942 3, 931 F8 67 8 402 428 371 530 387 F9 70 8 449 478 438 582 444 F10 53 10 253 283 253 328 253 F11 95 18 458 521 481 583 470 F12 1, 529 178 8, 505 11, 554 12, 383 14, 821 10, 030 F13 2, 035 287 11, 082 13, 476 16, 369 15, 817 12, 279 F14 269 26 1, 698 2, 255 1, 856 2, 280 1, 777 Median 184 18 1, 078 1, 465 1, 169 1, 432 The overall runtime for CPAN-14 , without considering the subcorpora, is given in Figure 4. An interesting observation that can be made by comparing Figure 4 to Fig- 70000 58424 60000 50000 46140 40000 30000 27303 21708 20808 20000 13230 10000 8048 1019 2147 2423 1830 2042 1139 2085 0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 Figure 4. Runtime needed by VEBAV for the classification of CPAN-14 (without considering the subcorpora). ure 2 is that F5 can be seen as an optimal candidate when a trade-off between a high classification result and low runtime must be taken into account. Here, F5 achieves a classification result of 59, 96% by requiring only 2,423 milliseconds. F8 behaves in a similar way (55, 56% accuracy by requiring only 1,830 milliseconds). 1061 6 Conclusion & future work In this paper we presented a simple, scalable and fast authorship verification scheme for the Author Identification task within the PAN-2014 competition. Our method pro- vides a number of benefits as for example that it is able to handle (even very short) texts written in several languages, across different kinds of genre. Besides this, the method is independent of linguistic resources such as ontologies, thesauruses, language models, etc. A further benefit is the low runtime of the method, since there is no need for deep linguistic processing like POS-tagging, chunking or parsing. Another benefit is that the involved components within the method can be replaced easily as for example the dis- tance function (required for classification), the acceptance-threshold or the feature sets including their parameters. Moreover, the classification itself can be modified easily, e.g. by using an ensemble of several distance functions. Unfortunately, besides benefits our approach has several pitfalls too. One of the biggest challenges, for example, is the inscrutability of the methods parameter-space, due to the fact that the number of possible configuration settings is near infinite. Such settings include for instance the λ parameter for the involved distance function, the val- ues for the n and k parameters, the weight (ω) and tolerance (τ ) parameters that influ- ence the classification quality but also other options such as ` (the number of chunks) or the used feature normalization strategy. Due to the complexity of our scheme we could only perform a small number of experiments to obtain at least an optimal λ and the most promising feature sets. This, however, was done only for the official PAN corpus but not for other corpora. Thus, it had not been proved if the learned parameters are able to perform satisfactorily on other corpora. Another challenge that remains unsolved is how to optimize the probability score ρ, determined in the decision determination step, as this value has also a strong influence on the resulting AUC · c@1 scores, which in our case are very low. Hence, further tests are essential for the applicability of our scheme. Acknowledgments. This work was supported by the CASED Center for Advanced Security Research Darmstadt, Germany funded by the German state government of Hesse under the LOEWE program (www.CASED.de). References 1. Cha, S.H.: Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences 1(4), 300–307 (2007) 2. Juola, P., Stamatatos, E.: Overview of the Author Identification Task at PAN 2013. In P. Forner, R. Navigli, and D. Tufis (eds) CLEF 2013 Evaluation Labs and Workshop - Working Notes Papers (2013) 3. Koppel, M., Schler, J.: Authorship Verification as a One-Class Classification Problem. In: Proceedings of the twenty-first international conference on Machine learning. pp. 62–. ICML ’04, ACM, New York, NY, USA (2004) 4. Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (Mar 2009) 5. Tax, D.M.J.: One-Class Classification. Concept Learning In the Absence of Counter-Examples. Ph.D. thesis, Delft University of Technology (2001) 1062