=Paper= {{Paper |id=Vol-1180/CLEF2014wn-Pan-StamatosEt2014 |storemode=property |title=Overview of the Author Identification Task at PAN 2014 |pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-StamatosEt2014.pdf |volume=Vol-1180 |dblpUrl=https://dblp.org/rec/conf/clef/StamatatosDVSPJSB14 }} ==Overview of the Author Identification Task at PAN 2014== https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-StamatosEt2014.pdf
        Overview of the Author Identification Task
                      at PAN 2014

        Efstathios Stamatatos1, Walter Daelemans2, Ben Verhoeven2, Martin Potthast3,
                  Benno Stein3, Patrick Juola4, Miguel A. Sanchez-Perez5,
                                and Alberto Barrón-Cedeño6
                                    1
                                     University of the Aegean, Greece
                                2
                                   University of Antwerp, Belgium
                             3
                               Bauhaus-Universität Weimar, Germany
                                      4
                                        Duquesne University, USA
                              5
                                Instituto Politécnico Nacional, Mexico
                          6
                            Universitat Politècnica de Catalunya, Spain



         Abstract. The author identification task at PAN-2014 focuses on author
         verification. Similar to PAN-2013 we are given a set of documents by the same
         author along with exactly one document of questioned authorship, and the task
         is to determine whether the known and the questioned documents are by the
         same author or not. In comparison to PAN-2013, a significantly larger corpus
         was built comprising hundreds of documents in four natural languages (Dutch,
         English, Greek, and Spanish) and four genres (essays, reviews, novels, opinion
         articles). In addition, more suitable performance measures are used focusing on
         the accuracy and the confidence of the predictions as well as the ability of the
         submitted methods to leave some problems unanswered in case there is great
         uncertainty. To this end, we adopt the c@1 measure, originally proposed for the
         question answering task. We received 13 software submissions that were
         evaluated in the TIRA framework. Analytical evaluation results are presented
         where one language-independent approach serves as a challenging baseline.
         Moreover, we continue the successful practice of the PAN labs to examine
         meta-models based on the combination of all submitted systems. Last but not
         least, we provide statistical significance tests to demonstrate the important
         differences between the submitted approaches.



1       Introduction

Authorship analysis has attracted much attention in recent years due to both the rapid
increase of texts in electronic form and the need for intelligent systems able to handle
this information. Authorship analysis deals with the personal style of authors and
includes three major areas:
    -   Author identification: Given a set of candidate authors for whom some texts of
        undisputed authorship exist, attribute texts of unknown authorship to one of the
        candidates. This can be applied mainly to forensic applications and literary
        analysis [13, 31].




                                               877
 -   Author profiling: The extraction of demographic information such as gender,
     age, etc. about the authors. This has significant applications mainly in market
     analysis [28].
 -   Author clustering: The segmentation of texts into stylistically homogeneous
     parts. This can be applied to distinguish different authors in collaborative
     writing, to detect plagiarism without a reference corpus (i.e., intrinsic plagiarism
     detection [35]), and to detect changes in the personal style of a certain author by
     examining their works chronologically [14].
   Author identification is by far the most prevalent field of authorship analysis in
terms of published studies. The authorship attribution problem can be viewed as a
closed-set classification task where all possible candidate authors are known. This is
suitable in many forensic applications where the investigators of a case can provide a
specific set of suspects based on certain restrictions (e.g., access to specific material,
knowledge of specific facts, etc.). A more general definition of the authorship
attribution problem corresponds to an open-set classification task where the true
author of the disputed texts is not necessarily included in the set of candidate authors.
This setting is much more difficult in comparison to the closed-set attribution
scenario, especially when the size of the candidate author set is small [18]. Finally,
when the set of candidate authors is singleton, we get the author verification problem.
This is an even more difficult attribution task.
   The PAN-2014 evaluation lab continues the practice of PAN-2013 and focuses on
the author verification problem [15]. First, this is a fundamental problem in
authorship attribution [20] and by studying it we can extract more useful conclusions
about the performance of certain attribution methods. Any author identification task
can be decomposed into a series of author verification problems. Therefore, the ability
of an approach to effectively deal with this task means that it can cope with every
authorship attribution problem. Moreover, in comparison to PAN-2013, we provide a
larger collection of verification problems including more natural languages and
genres. Thus, we can study more reliably the performance of the submitted
approaches under different conditions and test their ability to be adapted to certain
properties of documents. In addition, we define more appropriate performance
measures that are suitable for this cost-sensitive task focusing on the ability of the
submitted approaches to assign confidence scores in their answers as well as their
ability to leave the most uncertain cases unanswered.
   Based on the successful practice of PAN-2013, we build a meta-classifier to
combine all submitted approaches and examine the performance of this ensemble
model in comparison to the individual participants [15]. Moreover, we use one
effective model submitted to PAN-2013 as a baseline method. This enables us to have
a more challenging baseline (in comparison to random guess) that reflects and can be
adapted to the difficulty of a certain corpus. Finally, we provide tests of statistical
significance to examine whether there are important differences in the performance of
the submitted methods, the baseline, and the meta-classifier.
   In the remainder of this paper, Section 2 reviews previous work in author
verification, Section 3 analytically describes the evaluation setup used at PAN-2014
and Section 4 presents the evaluation results in detail. A review of the submitted




                                           878
approaches is included in Section 5 and Section 6 summarizes the main conclusions
that can be drawn and discusses future work directions.


2    Relevant Work

The author verification problem was first discussed in [32]. Based on a corpus of
newspaper articles in Greek, they used multiple regression to produce a response
function for a given author and a threshold value to determine whether or not a
questioned document was by that author. False acceptance and false rejection rates
were used to evaluate this model. The same metrics were used by [37] to evaluate an
authorship verification method based on a rich set of linguistic features.
   Perhaps the best-known approach for author verification, the unmasking method,
was introduced in [19]. The main idea is to build a SVM classifier to distinguish the
questioned document from the set of known documents, then to remove the most
important features and repeat this process. In case the questioned and known
documents are by the same author, the accuracy of the classifier significantly drops
after a small number of repetitions while it remains relatively high when they are not
by the same author. Accuracy and F1 were used to evaluate this method that was very
effective in long documents but fails when documents are relatively short [33].
Modifications and additional evaluation tests for the unmasking method can be found
in [34] and [16].
   Luyckx and Daelemans approximated the author verification problem as a binary
classification task by considering all available texts by other authors as negative
examples [22]. They used recall, precision, and F1 to evaluate their approach in a
corpus of student essays in Dutch. Escalante et al. applied particle swarm model
selection to select a suitable classifier for a given author [5]. They used F1 and
balanced error rate (the average of error rates for positive and negative class) to
evaluate their approach on two corpora of English newswire stories and Spanish
poems. More recently, Koppel and Winter proposed an effective method that attempts
to transform authorship verification from a one-class classification task to a multi-
class classification problem by introducing additional authors, the so-called
impostors, using documents found in external sources (e.g., the Web) [20]. Accuracy
and recall-precision graphs were used to evaluate this method.
   Author verification was included in previous editions of the PAN evaluation lab.
The author identification task at PAN-2011 [1] included 3 author verification
problems, each comprising a number of texts (i.e., email messages) of known
authorship, all by the same author, and a number of questioned texts (either by the
author of the known texts or not). Performance was measured by macro-average
precision, recall and F1. PAN-2013 was exclusively focused on the author verification
problem [15]. New training and evaluation corpora were built on three languages (i.e.,
English, Greek, and Spanish) where each verification problem included at most 10
documents by the same author and exactly one questioned document. Beyond a binary
answer for each verification problem, the participants could also produce (optionally)
a probability-like score to indicate the confidence of a positive answer. Recall,
precision, F1 and ROC graphs were used to evaluate the performance of the 18




                                         879
participants. Moreover, a simple meta-model combining all the submitted methods
achieved the best overall performance. For the first time, software submissions were
requested at PAN-2013 enabling reproducibility of the results and future evaluation
on different corpora.


3     Evaluation Setup

PAN-2014 focuses on author verification, similar to PAN-2013. Given a set of known
documents all written by the same author and exactly one questioned document, the
task is to determine whether the questioned document was written by that particular
author or not. Similar to the corresponding task at PAN-2013, best efforts were
applied to ensure that all known and questioned documents within a problem are
matched for genre, register, theme, and date of writing. In contrast to PAN-2013, the
number of known documents is limited to at most 5, while a greater variety of
languages and genres is covered. The text length of documents varies from a few
hundred to a few thousand words, depending on the genre.
   The participants were asked to submit their software and consider as input
parameters the language and genre of the documents. For each verification problem,
they should provide a score, a real number in [0,1], corresponding to the probability
of a positive answer (i.e., the known and the questioned documents are by the same
author). In case the participants wanted to leave some verification problems
unanswered, they could assign a probability score of exactly 0.5 to those problems.


3.1   Corpus

The PAN-2014 corpus comprises author verification problems in four languages:
Dutch, English, Greek, and Spanish. For Dutch and English there are two genres in
separate parts of the corpus. An overview of the training and evaluation corpus of the
author identification task is shown in Table 1. As can be seen, beyond language and
genre there is variety of known texts per problem and text-length. The size of both
training and evaluation corpora is significantly larger than the corresponding corpora
of PAN-2013. All corpora in both training and evaluation sets are balanced with
respect to the number of positive and negative examples.
   The Dutch corpus is a transformed version of the CLiPS Stylometry Investigation
(CSI) corpus [38]. This recently released corpus contains documents from two genres:
essays and reviews, which are the two Dutch genres present in the corpus for this task.
All documents were written by language students at the University of Antwerp
between 2012 and 2014. All authors are native speakers of Dutch. The CSI corpus
was developed for use in computational stylometry research (i.e. detection of age,
gender, personality, region of origin, etc.), but has many other purposes as well (e.g.,
deception detection, sentiment analysis). We adapted the CSI corpus to match the
needs of the authorship verification task and ended up with 200 problem sets for the
review genre and 192 problem sets in the essay genre. All verification problems
include 1-5 known texts. The training and evaluation set each contain half of the
problem sets in each genre.




                                          880
Table 1. Statistics of the training and evaluation corpora used in the author identification task
                                          at PAN-2014.

                                                                     Avg. of       Avg.
                                                                      known       words
                  Language       Genre      #Problems      #Docs
                                                                     docs per       per
                                                                     problem     document
                    Dutch       Essays          96           268       1.8         412.4
                    Dutch      Reviews          100          202       1.0         112.3
                   English      Essays          200          729       2.6         848.0
      Training     English      Novels          100          200       1.0        3,137.8
                    Greek       Articles        100          385       2.9        1,404.0
                   Spanish      Articles        100          600       5.0        1,135.6
                          Total                 696         2,384      2.4        1,091.0
                    Dutch       Essays          96           287       2.0         398.1
                    Dutch      Reviews          100          202       1.0         116.3
                   English      Essays          200          718       2.6         833.2
     Evaluation    English      Novels          200          400       1.0        6,104.0
                    Greek       Articles        100          368       2.7        1,536.6
                   Spanish      Articles        100          600       5.0        1,121.4
                          Total                 796         2,575      2.2        1,714.9
      TOTAL                                    1,492        4,959      2.3        1,415.0


   The English essays corpus was derived from a previously existing corpus of
English-as-second-language students. The Uppsala Student English (USE) corpus [2]
was originally intended to become a tool for research on foreign languages learning. It
consists of university-level full-time students' essays handed by electronic means. In
this kind of texts stylistic awareness represents an important writing factor. The USE
corpus includes clear borders between writings produced in the framework of three
different terms: a, b, and c. Every essay is intended to be produced on personal,
formal, or academic style. A total of 440 authors contributed with at least one essay to
the corpus, resulting in 1,489 documents. The average size of an essay is 820 words.
Typically, one student contributed with more than one essay, often surpassing the
different terms. Taking advantage of the USE corpus meta-information, we defined
two main constraints: every document in the collection, known or questioned, should
contain at least 500 words and the number of known documents in a case must range
between one and five. As a result of the first constraint, only 435 authors were
considered. We also took advantage of the students' background information to set
case-generation rules. Firstly, all the documents in a case must come from students
from the same term (i.e., both were written within term a, b, or c). Secondly, we
divided the students in age-based clusters. To form negative verification problems,
based on the fact that the students' age ranged between 18 and 59 years, an author A
was considered as candidate match for author Aq according to the following rules:
 -   If Aq is younger than 20 years old, A must be younger than 20 as well;




                                              881
 -    If Aq is between 20 and 25 years old, A must be exactly the same age;
 -    If Aq is between 26 and 30 years old, A must be in the same age range; and
 -    If Aq is older than 30 years old, A must be older than 30 as well.
   This combination of age- and term-related constraints allowed us to create cases
where the authors come from similar backgrounds. During our generation process, the
texts as in the USE corpus were slightly modified. Anonymization labels were
substituted by a randomly chosen proper name in English. In order not to provide any
hint about a case, the same name was used both in the questioned and known
documents. One source USE document could be considered at most twice in the
authorship verification corpus: once in a positive case and once in a negative case.
   The English novels used in the PAN-2014 corpus represent an attempt to provide a
narrower focus in terms of both content and writing style than many similar
collections. Instead of simply focusing on a single genre or time period, they focus on
a very small subgenre of speculative and horror fiction known generally as the
“Cthulhu Mythos”. This is specifically a shared-universe genre, based originally on
the writings of the American H.P. Lovecraft (for this reason, the genre is also called
“Lovecraftian horror”), a shared universe with a theme of human ineffectiveness in
the face of a set of powerful named “cosmic horrors”. It is also typically characterized
by extremely florid prose and an unusual vocabulary. Perhaps most significantly,
many of the elements of this genre are themselves unusual terms (e.g.,
unpronounceable proper names of these cosmic horrors such as “Cthulhu”,
“Nyarlathotep”, “Lloigor”, “Tsathoggua”, or “Shub-Niggurath”), thus creating a
strong shared element that is unusual in regular English prose. Similarly, the overall
theme and tone of these stories is strongly negative (many of them, for example, take
the form of classical tragedies and end with the death of the protagonist). For this
reason, we feel that this testbed provides a number of unusual elements that may be
appropriately explored as an example of a tightly controlled genre. The corpus covers
an extended length of time, from Lovecraft's original work to modern fan-fiction.
Documents were gathered from a variety of on-line sources including Project
Gutenberg1 and FanFiction2, and edited for uniformity of format; in some cases
lengthy works were broken down into subsections based on internal divisions such as
chapters or sections.
   The Greek corpus comprises newspaper opinion articles published in the Greek
weekly newspaper TO BHMA3 from 1996 to 2012. Note that the training corpus in
Greek was formed based on the respective training and evaluation corpora of PAN-
2013. The length of each article is at least 1,000 words while the number of known
texts per problem varies between 1 to 5. In each verification problem, we included
texts that had strong thematic similarities indicated by the occurrence of certain
keywords. In contrast to PAN-2013, there was no stylistic analysis of the texts to
indicate authors with very similar styles or texts of the same author with notable
differences.


1 http://www.gutenberg.org/
2 https://www.fanfiction.net/
3 http://www.tovima.gr




                                           882
   The Spanish corpus refers to the same genre as the Greek corpus. Newspaper
opinion articles of the Spanish newspaper El Pais4 were considered and author
verification problems were formed taking into account thematic similarities between
articles as indicated by certain keywords used to index the articles in the website of
this newspaper. All verification problems for this corpus include exactly five known
texts, while the average text length is relatively large, exceeding 1,000 words.


3.2    Performance measures

The probability scores provided by the participants are used to build ROC curves and
the area under the curve (AUC) is used as a scalar evaluation measure. This is a well-
known evaluation technique for binary classifiers [6]. In addition, the performance
measures used in this task should be able to take unanswered problems into account.
Similarly to other tasks, like question answering, it is preferred to leave the problem
unanswered rather than responding incorrectly when there is great uncertainty. The
measures of recall and precision used at PAN-2013 were not able to reward
submissions that left problems unanswered while maintaining high accuracy in given
answers.
   In the current evaluation setup we adopted the c@1 measure, originally proposed
for question answering tasks, which explicitly extends accuracy based on the number
of problems left unanswered [27]. More specifically, to use this measure we first
transform probability scores to binary answers. Every score greater than 0.5 is
considered as a positive answer (i.e., the known and questioned documents are by the
same author), every score lower than 0.5 is considered as a negative answer (i.e., the
known and questioned documents are by different authors) while all scores equal to
0.5 correspond to unanswered problems. Then, c@1 is defined as follows:
                                                     )

where n is the number of problems, nc is the number of correct answers, and nu is the
number of problems left unanswered. If a participant would provide an answer
different from 0.5 for all problems, then c@1 will be equal to accuracy. If all
problems are left unanswered, then c@1 will be zero. If only some problems are left
unanswered, this measure will be increased as if these problems were answered with
the same accuracy as the rest of the problems. Therefore, this measure rewards
participants that maintain a high number of correct answers, for which there is great
confidence, and decrease the number of incorrect answers, for uncertain cases, by
leaving them unanswered.
   To provide a final rank of participants, AUC and c@1 are combined in the final
score which is merely the product of these two measures. In addition, the efficiency of
the submitted methods is measured in terms of elapsed runtime.




4 http://elpais.com




                                         883
3.4    Baseline

The author verification task has a random guess baseline of 0.5 for both AUC and
c@1. However, this baseline is not challenging. What we need is a baseline that
corresponds to a standard method so that we know what submissions are really better
than the state of the art. Moreover, since the evaluation corpus comprises several
languages and genres, we need a baseline that can reflect and adapt to the difficulty of
a specific corpus.
   Based on the submissions of the author identification task at PAN-2013, it is
possible to use state-of-the-art methods (in particular, the PAN-2013 winners) and
apply them to PAN-2014 corpus. However, since the PAN-2014 task comprises more
languages, we need a language-independent approach. In addition, we need a method
that can provide both binary answers and probability scores (the latter was optional at
PAN-2013). Based on these requirements, we selected the approach of [11] to serve
as baseline. More specifically, this approach has the following characteristics:
 -    It is language-independent.
 -    It can provide both binary answers and real scores.
 -    The real scores are already calibrated to probability-like scores for a positive
      answer (i.e., all scores greater than 0.5 correspond to a positive answer).
 -    It was the winner of PAN-2013 in terms of overall AUC scores.
   It should be noted that this baseline method has not been specifically trained on the
corpora of PAN-2014, so its performance is not optimized. It can only be viewed as a
general method that can be applied to any corpus. Moreover, this approach does not
leave problems unanswered, so it cannot take advantage of the new performance
measures.


3.5    Meta-classifier

Following the practice of PAN-2013, we examine the performance of a meta-model
that combines all answers given by the participants for each problem. We define a
straight-forward meta-classifier that calculates the average of the probability scores
provided by the participants for each problem. It can be seen as a heterogeneous
ensemble model that combines base classifiers corresponding to different approaches.
Note that the average of all the provided answers is not likely to be exactly 0.5; hence,
this meta-model very rarely leaves problems unanswered. This meta-model can be
naturally extended by allowing all answers with a score between 0.5-a and 0.5+a to
become equal to 0.5. However, since the parameter a should be tuned to an arbitrary
predefined value or be optimized for each language/genre, we decided not to perform
such an extension.




                                          884
          Table 2. Overall evaluation results of the author identification task at PAN-2014.

                                                                                        Unansw.
    Rank                                  FinalScore    AUC       c@1       Runtime
                                                                                        Problems
             META-CLASSIFIER                 0.566      0.798     0.710                     0
      1      Khonji & Iraqi                  0.490      0.718     0.683    20:59:40         2
      2      Frery et al.                    0.484      0.707     0.684    00:06:42         28
      3      Castillo et al.                 0.461      0.682     0.676    03:59:04         78
      4      Moreau et al.                   0.451      0.703     0.641    01:07:34         50
      5      Mayor et al.                    0.450      0.690     0.651    05:26:17         29
      6      Zamani et al.                   0.426      0.682     0.624    02:37:25         0
      7      Satyam et al.                   0.400      0.631     0.634    02:52:37         7
      8      Modaresi & Gross                0.375      0.610     0.614    00:00:38         0
      9      Jankowska et al.                0.367      0.609     0.602    07:38:18         7
     10      Halvani & Steinebach            0.335      0.595     0.564    00:00:54         3
             BASELINE                        0.325      0.587     0.554    00:21:10         0
     11      Vartapetiance & Gillam          0.308      0.555     0.555    01:07:39         0
     12      Layton                          0.306      0.548     0.559    27:00:01         0
     13      Harvey                          0.304      0.558     0.544    01:06:19        100




4         Evaluation Results

We received 13 submissions from research teams in Australia, Canada (2), France,
Germany (2), India, Iran, Ireland, Mexico (2), United Arab Emirates, and United
Kingdom. The participants submitted and evaluated their author verification software
within the TIRA framework [8]. A separate run for each corpus corresponding to each
language and genre was performed.
   The overall results of the task concerning the performance of the submitted
approaches in the whole evaluation corpus are shown in Table 2. These evaluation
scores are the result of micro-averaging over the set of 796 verification problems. Put
in other words, each verification problem has the same weight in this analysis, so the
language and genre information are not taken into account. As can be seen, the overall
winner method of Khonji and Iraqi [17] achieved the best results in terms of AUC and
was also very effective in terms of c@1. On the other hand, it was one of the less
efficient methods requiring about 21 hours for processing the whole evaluation
corpus. The second best submission by Frery et al. [7] was much more efficient and
achieved the best c@1 score. In general, most of the submitted methods outperformed
the baseline. It has to be emphasized that the best five participants were able to leave
some problems unanswered. In total 4 out of the 13 participants answered all
problems. Moreover, one participant provided binary answers instead of probability
scores [36] and one participant did not process the Greek corpus [10]. With respect to
the meta-classifier, which is averaging the answers of all 13 participants, its
performance is significantly better than each individual system, achieving a final
score greater than 0.5.




                                                 885
       Table 3. Evaluation results on the evaluation corpus of Dutch essays.

                                                                          Unansw.
                           FinalScore    AUC        c@1       Runtime
                                                                          Problems
META-CLASSIFIER              0.867       0.957     0.906                      0
Mayor et al.                 0.823       0.932     0.883     00:15:05         2
Frery et al.                 0.821       0.906     0.906     00:00:30         0
Khonji & Iraqi               0.770       0.913     0.844     00:58:21         0
Moreau et al.                0.755       0.907     0.832     00:02:09        34
Castillo et al.              0.741       0.861     0.861     00:01:57         2
Jankowska et al.             0.732       0.869     0.842     00:23:26         1
BASELINE                     0.685       0.865     0.792     00:00:52         0
Zamani et al.                0.525       0.741     0.708     00:00:27         0
Vartapetiance & Gillam       0.517       0.719     0.719     00:06:37         0
Satyam et al.                0.489       0.651     0.750     00:01:21         0
Halvani & Steinebach         0.399       0.647     0.617     00:00:06         2
Harvey                       0.396       0.644     0.615     00:02:19         0
Modaresi & Gross             0.378       0.595     0.635     00:00:05         0
Layton                       0.307       0.546     0.563     00:55:07         0


      Table 4. Evaluation results on the evaluation corpus of Dutch reviews.

                                                                          Unansw.
                           FinalScore    AUC        c@1       Runtime
                                                                          Problems
Satyam et al.                0.525       0.757     0.694     00:00:16         2
Khonji & Iraqi               0.479       0.736     0.650     00:12:24         0
META-CLASSIFIER              0.428       0.737     0.580                      0
Moreau et al.                0.375       0.635     0.590     00:01:25         0
Zamani et al.                0.362       0.613     0.590     00:00:11         0
Jankowska et al.             0.357       0.638     0.560     00:06:24         0
Frery et al.                 0.347       0.601     0.578     00:00:09         5
BASELINE                     0.322       0.607     0.530     00:00:12         0
Halvani & Steinebach         0.316       0.575     0.550     00:00:03         0
Mayor et al.                 0.299       0.569     0.525     00:07:01         1
Layton                       0.261       0.503     0.520     00:56:17         0
Vartapetiance & Gillam       0.260       0.510     0.510     00:05:43         0
Castillo et al.              0.247       0.669     0.370     00:01:01        76
Modaresi & Gross             0.247       0.494     0.500     00:00:07         0
Harvey                       0.170       0.354     0.480     00:01:45         0




                                        886
      Table 5. Evaluation results on the evaluation corpus of English essays.

                                                                          Unansw.
                           FinalScore    AUC        c@1       Runtime
                                                                          Problems
META-CLASSIFIER              0.531       0.781     0.680                      0
Frery et al.                 0.513       0.723     0.710     00:00:54        15
Satyam et al.                0.459       0.699     0.657     00:16:23         2
Moreau et al.                0.372       0.620     0.600     00:28:15         0
Layton                       0.363       0.595     0.610     07:42:45         0
Modaresi & Gross             0.350       0.603     0.580     00:00:07         0
Khonji & Iraqi               0.349       0.599     0.583     09:10:01         1
Halvani & Steinebach         0.338       0.629     0.538     00:00:07         1
Zamani et al.                0.322       0.585     0.550     00:02:03         0
Mayor et al.                 0.318       0.572     0.557     01:01:07        10
Castillo et al.              0.318       0.549     0.580     01:31:53         0
Harvey                       0.312       0.579     0.540     00:10:22         0
BASELINE                     0.288       0.543     0.530     00:03:29         0
Jankowska et al.             0.284       0.518     0.548     01:16:35         5
Vartapetiance & Gillam       0.270       0.520     0.520     00:16:44         0


      Table 6. Evaluation results on the evaluation corpus of English novels.

                                                                          Unansw.
                           FinalScore    AUC        c@1       Runtime
                                                                          Problems
Modaresi & Gross             0.508       0.711     0.715     00:00:07         0
Zamani et al.                0.476       0.733     0.650     02:02:02         0
META-CLASSIFIER              0.472       0.732     0.645                      0
Khonji & Iraqi               0.458       0.750     0.610     02:06:16         0
Mayor et al.                 0.407       0.664     0.614     01:59:47         8
Castillo et al.              0.386       0.628     0.615     02:14:11         0
Satyam et al.                0.380       0.657     0.579     02:14:28         3
Frery et al.                 0.360       0.612     0.588     00:03:11         1
Moreau et al.                0.313       0.597     0.525     00:11:04        12
Halvani & Steinebach         0.293       0.569     0.515     00:00:07         0
Harvey                       0.283       0.540     0.525     00:46:30         0
Layton                       0.260       0.510     0.510     07:27:58         0
Vartapetiance & Gillam       0.245       0.495     0.495     00:13:03         0
Jankowska et al.             0.225       0.491     0.457     02:36:12         1
BASELINE                     0.202       0.453     0.445     00:08:31         0




                                        887
       Table 7. Evaluation results on the evaluation corpus of Greek articles.

                                                                            Unansw.
                           FinalScore     AUC        c@1       Runtime
                                                                            Problems
Khonji & Iraqi                0.720       0.889     0.810      03:41:48         0
META-CLASSIFIER               0.635       0.836     0.760                       0
Mayor et al.                  0.621       0.826     0.752      00:51:03         3
Moreau et al.                 0.565       0.800     0.707      00:05:54         4
Castillo et al.               0.501       0.686     0.730      00:03:14         0
Jankowska et al.              0.497       0.731     0.680      01:36:00         0
Zamani et al.                 0.470       0.712     0.660      00:15:12         0
BASELINE                      0.452       0.706     0.640      00:03:38         0
Frery et al.                  0.436       0.679     0.642      00:00:58         7
Layton                        0.403       0.661     0.610      04:40:29         0
Halvani & Steinebach          0.367       0.611     0.600      00:00:04         0
Satyam et al.                 0.356       0.593     0.600      00:12:01         0
Modaresi & Gross              0.294       0.544     0.540      00:00:05         0
Vartapetiance & Gillam        0.281       0.530     0.530      00:10:17         0
Harvey                        0.000       0.500     0.000                      100


      Table 8. Evaluation results on the evaluation corpus of Spanish articles.

                                                                            Unansw.
                           FinalScore     AUC        c@1       Runtime
                                                                            Problems
META-CLASSIFIER               0.709       0.898     0.790                       0
Khonji & Iraqi                0.698       0.898     0.778      04:50:49         1
Moreau et al.                 0.634       0.845     0.750      00:18:47         0
Jankowska et al.              0.586       0.803     0.730      01:39:41         0
Frery et al.                  0.581       0.774     0.750      00:01:01         0
Castillo et al.               0.558       0.734     0.760      00:06:48         0
Mayor et al.                  0.539       0.755     0.714      01:12:14         5
Harvey                        0.514       0.790     0.650      00:05:23         0
Zamani et al.                 0.468       0.731     0.640      00:17:30         0
Vartapetiance & Gillam        0.436       0.660     0.660      00:15:15         0
Halvani & Steinebach          0.423       0.661     0.640      00:00:27         0
Modaresi & Gross              0.416       0.640     0.650      00:00:08         0
BASELINE                      0.378       0.713     0.530      00:04:27         0
Layton                        0.299       0.553     0.540      05:17:25         0
Satyam et al.                 0.248       0.443     0.560      00:08:09         0




                                        888
   Tables 3-8 present the evaluation results on each of the six corpora separately. In
all tables, the best performing submission (excluding the meta-classifier and the
baseline method) is in boldface. In terms of average performance of all submitted
approaches, the corpus of Dutch essays seems to be the easiest while the corpus of
Dutch reviews to be the hardest one. The latter can be partially explained by the fact
that the corpus provides only one known document per problem and that it contains
only short texts. Moreover, the availability of multiple relatively long known
documents seems to assist the submitted systems to achieve a better average
performance on the Greek and Spanish corpora compared to the English corpora of
essays and novels. There is a different winner for each corpus with the exception of
[17] who won on both Greek and Spanish corpora. This might indicate a better tuning
of their approach for newspaper opinion articles rather than essays, reviews or novels.
However, the performance of this submission on all corpora is notable since it is
usually included in the first 3-best performing methods with the exception of the
English essays where it is ranked 6th (excluding the meta-classifier).
   The performance of the baseline method varies. In the English and Spanish corpora
it is relatively low. In the Dutch and Greek corpora it is very challenging,
outperforming almost half of the participants. In addition, the meta-classifier is very
effective on all corpora. However, it is outperformed by some individual participants
on three corpora. Another interesting remark is that the problems left unanswered by
most participants are not evenly distributed across the corpora. The majority of the
problems left unanswered by Castillo et al. [4] refer to Dutch reviews (possibly
reflecting the difficulty of this corpus). Similarly, Moreau et al. [25] did not answer
many problems of Dutch essays while most of the unanswered problems of Frery et
al. [7] belong to English essays and Greek articles. On the other hand, Mayor et al.
[23] left at least one problem unanswered in each corpus.
   The ROC curves of the best performing participants on the whole evaluation
corpus are shown in Figure 1. More specifically, the convex hull of all submitted
approaches together with the participants’ curves who are part of the convex hull are
shown. The overall winning approach of Khonji and Iraqi [17] and the second-best
method of Frery et al. [7] dominate the convex hull in case the false positive and false
negative errors have the same cost [6]. In low values of FPR in the ROC space, where
the cost of false positives is considered higher than the cost of false negatives, the
approach of Modaresi and Gross [23] is the best. On the other hand, if the false
negatives have larger cost than the false positives, in large values of FPR in the ROC
space, the approach of Moreau et al. [25] is the most effective. Note also that the
submission by Castillo et al. [4], ranked in the 3rd position in the overall results (see
Table 2), is not part of the convex hull meaning that this approach is always
outperformed by another approach no matter the cost of the false positives and false
negatives.
   In addition, Figure 1 depicts the ROC curves of the baseline method and the meta-
classifier. The baseline is clearly less effective than the best participants. It
outperforms only Frery et al. [7] in very low values of FPR. On the other hand, the
meta-classifier clearly outperforms the convex hull of all the submitted methods in the
whole range of the curve. This means that the meta-classifier is more effective than
any individual submission for any given cost of false positives and false negatives.




                                          889
                         Frery et al.                        Khonji & Iraqi
                         Mayor et al.                        Moreau et al.
                         Baseline                            Meta-Classifier
                         Convex Hull
           1




          0.8




          0.6
    TPR




          0.4




          0.2




           0
                0       0.2             0.4     FPR    0.6              0.8             1

Fig. 1. ROC graphs of the best performing submissions and their convex hull, the baseline
method, and the meta-classifier.

   We computed statistical significance of performance differences between systems
using approximate randomization testing [26]5. As noted by [39] among others, for
comparing outputs from classifiers, frequently used statistical significance tests such
as paired t-tests make assumptions that do not hold for precision scores and F-scores.
Approximate randomisation testing does not make these assumptions and can handle
complicated distributions. We did a pairwise comparison of accuracy of all systems
based on this method and the results are shown in Table 9. The null hypothesis is that
there is no difference in the output of two systems. When the probability of accepting
the null hypothesis is p < 0.05 we consider the systems to be significantly different
from each other. When p < 0.001 the difference is highly significant, when 0.001 < p
< 0.01 the difference is very significant, and when 0.01 < p < 0.05 the difference is
significant.



5   We used the implementation by Vincent Van Asch available from the CLiPS website
    http://www.clips.uantwerpen.be/scripts/art




                                              890
Table 9. Pairwise significance tests for the entire evaluation corpus. Significant differences are
     marked with asterisks, *** corresponds to highly significant difference (p < 0.001),
 ** corresponds to very significant difference (0.001 < p < 0.01), * corresponds to significant
    difference (0.01 < p < 0.05), while = means the difference is not significant (p > 0.05).
                   Khonji & Iraqi

                                    Frery et al.

                                                   Castillo et al.

                                                                     Moreau et al.

                                                                                     Mayor et al.

                                                                                                    Zamani et al.

                                                                                                                    Satyam et al.

                                                                                                                                    Modaresi & Gross

                                                                                                                                                       Jankowska et al.

                                                                                                                                                                          Halvani & Steinebach

                                                                                                                                                                                                 BASELINE

                                                                                                                                                                                                            Vartapetiance & Gillam

                                                                                                                                                                                                                                     Layton

                                                                                                                                                                                                                                              Harvey
 META-
                            =              * *** *** *** *** *** *** *** *** *** *** *** ***
 CLASSIFIER
 Khonji &
                                           =        ** ***                            **             **              **              **                 ** *** *** *** *** ***
 Iraqi
 Frery et al.                                               =                *              =               =               =                 =                  * *** *** ***                                                       ** ***

 Castillo et al.                                                             =              =               =               =                 =                  =                     *         **           **                        * ***

 Moreau et al.                                                                              =               =               =                 =                  =                     =              *                   *             = ***

 Mayor et al.                                                                                               =               =                 =                  =          **                   **           **                     ** ***

 Zamani et al.                                                                                                              =                 =                  =          **                   **           **                     ** ***

 Satyam et al.                                                                                                                                =                  =          **                   ** ***                              ** ***
 Modaresi &
                                                                                                                                                                 =                     *         **                       *             * ***
 Gross
 Jankowska et
                                                                                                                                                                                       =         **                       =             = ***
 al.
 Halvani &
                                                                                                                                                                                                      =                   =             = ***
 Steinebach
 BASELINE                                                                                                                                                                                                                 =             =        *
 Vartapetiance
                                                                                                                                                                                                                                        =     **
 & Gillam
 Layton                                                                                                                                                                                                                                       **


   Based on this analysis, it is easy to see that there are no significant differences in
systems of neighboring rank. The winner submission of [17] is either very
significantly or highly significantly better than the rest of the approaches (with the
exception of the second winner [7]). In addition, the meta-classifier is highly
significantly better than all the participants except for the first two winners.




                                                                                                    891
5    Survey of Submissions

Among 13 participant approaches, 7 were submitted by teams that had participated
also in the PAN-2013 competition. Some of them attempted to improve the method
proposed in 2013 [9, 12, 21, 36] and others presented new models [4, 23, 25].
   All the submitted approaches can be described according to some basic properties.
First, an author verification method is either intrinsic or extrinsic. For each
verification problem, intrinsic methods use only the known texts and the unknown
text of that problem to make some analysis and decide whether they are by the same
author or not. They don’t make use of any other texts by other authors. The majority
of submitted approaches falls into this category [4, 7, 9, 10, 12, 21, 24, 25, 29, 36]. On
the other hand, extrinsic methods attempt to transform author verification from a one-
class classification task (where the known texts are the positive examples and there
are no negative examples) to a binary classification task (where documents by other
authors play the role of the negative examples). To this end, for each verification
problem, extrinsic methods need additional documents by other authors found in
external resources. The approaches of [17, 23, 40] belong to this category. The winner
submission of PAN-2014 by [17] is a modification of the Impostors method [20],
similarly to PAN-2013 [30], where a corpus of external documents for each
language/genre was used.
   Another important characteristic of a verification method is its type of learning.
There are lazy approaches where the training phase is nearly omitted and all necessary
processing is performed at the time they have to decide about a new verification
problem. Most of the submitted approaches follow this idea [4, 9, 10, 12, 17, 21, 23,
29, 36, 40]. On the other hand, eager methods attempt to build a general model based
on the training corpus. For example, [7] builds a decision tree for each corpus, [25]
apply a genetic algorithm to find the characteristics of the verification model for each
corpus, and [24] use fuzzy C-means clustering to extract a general description of each
corpus. Since eager methods perform most of the necessary calculations in the
training phase, they are generally more efficient in terms of runtime.
   With respect to the features used for text representation, the majority of the
participant methods focused on low-level measures. More specifically most of the
proposed features are either character measures (i.e., punctuation mark counts,
prefix/suffix counts, character n-grams, etc.) or lexical measures (i.e., vocabulary
richness measures, sentence/word length counts, stopword frequency, n-grams of
words/stopwords, word skip-grams, etc.). There were a few attempts to incorporate
syntactic features, namely POS tag counts [17, 25, 40], while one approach was
exclusively based on that type of information [10].


6    Discussion

The author identification task at PAN-2014 focused on the author verification
problem. The task definition was practically the same as in PAN-2013. However, this
year we substantially enlarged both training and evaluation corpora and enriched them
to include several languages and genres. In that way, we enabled participants to study




                                           892
how they can adapt and fine-tune their approaches according to a given language and
genre. Another important novelty was the use of different performance measures that
put emphasis on both the appropriate ranking of the provided answers in terms of
confidence (AUC) as well as the ability of the submitted systems to leave some
problems unanswered when there is great uncertainty (c@1). We believe that this
combination of performance measures is more appropriate for author verification, a
cost-sensitive task.
   Similar to PAN-2013, the overall winner was a modification of the Impostors
method [17]. The performance of this approach was notably stable in all six different
corpora despite the fact that it did not leave many problems unanswered. This
demonstrates the great potential of extrinsic verification methods. In addition, the
significantly larger training corpus allowed participants to explore, for the first time,
the use of eager learning methods in the author verification task. Such an approach
may be both effective and efficient as it is demonstrated by the overall performance
and runtime of the second overall winner [7].
   We received 13 software submissions, a reduced figure in comparison to 18
submissions at PAN-2013, possibly due to the greater difficulty of the task. Moreover,
this year the evaluation of the submitted systems was performed by participants
themselves using the TIRA framework [8]. Seven participants from PAN-2013
submitted their approaches again this year. It is remarkable that those teams that
slightly modified their existing approach did not achieve a high performance [9, 12,
21, 36]. On the other hand, the teams that radically changed their approach, including
the ability to leave some problems unanswered, achieved very good results [4, 23,
25].
   Based on the software submissions at PAN-2013, we were able to define a
challenging baseline method that is better than random guessing and can reflect the
difficulty of the examined corpus. In many cases, the baseline method was ranked in
the middle of the participants list, clearly showing the approaches with notable
performance. Given the enhanced set of methods for author verification, collected at
PAN-2013 and PAN-2014, we think that it will be possible to further improve the
quality of the baseline methods in future competitions. Moreover, following the
successful practice of PAN-2013, we examined the performance of a meta-model that
combines all submitted systems in a heterogeneous ensemble. This meta-classifier
was better than each individual submitted method while its ROC curve clearly
outperformed the convex hull of all submitted approaches. This demonstrates the
great potential of heterogeneous models in author verification, a practically
unexplored area.
   For the first time, we applied statistical significance tests on the results of the
submitted methods to highlight the real differences between them. According to these
tests, there is no significant difference between systems ranked in neighboring
positions. However, there are highly significant differences between the winner
approach and the rest of the submissions (with the exception of the second winner).
We believe that such significance tests are absolutely necessary to extract reliable
conclusions and we are going to adopt them in future evaluation labs.
   One of our ambitions in this task was to involve experts from forensic linguistics
so that they can manually (or semi-automatically) analyze the same corpora and
submit their answers. This could serve as another very interesting baseline approach




                                          893
that would enable the comparison of fully-automated systems with traditional human
expert methods. Unfortunately, this attempt was not successful. So far, we were not
able to find experts in forensic linguistics willing to participate or to devote the
necessary time to solve a large amount of author verification problems under certain
time constraints. We are still working on this direction.
   We believe that the focus of PAN-2013 and PAN-2014 on the author verification
task has produced a significant progress in this field concerning the development of
new corpora and new methods as well as in defining an appropriate evaluation
framework. Clearly, author verification is far from being a solved task and there are
many variations that can be explored in future evaluation labs including cross-topic
and cross-genre verification (i.e., where the known and the questioned documents do
not match in terms of topic/genre) and very short text verification (i.e., where the
documents are tweets or SMS messages).


Acknowledgement

This work was partially supported by the WIQ-EI IRSES project (Grant No. 269180)
within the FP7 Marie Curie action and by grant OCI-1032683 from the United States
National Science Foundation. The work of the last author is funded by the Spanish
Ministry of Education and Science (TACARDI project, TIN2012-38523-C02-00).


References

1.   S. Argamon and P. Juola. Overview of the International Authorship
     Identification Competition at PAN-2011. In V. Petras, P. Forner, P.D. Clough
     (eds.) CLEF Notebook Papers/Labs/Workshop, 2011.
2.   M. W. Axelsson. USE--The Uppsala Student English Corpus: An instrument for
     needs analysis, ICAME Journal, 24:155-157, 2000.
3.   L. Cappellato, N. Ferro, M. Halvey, and W. Kraaij (eds.). CLEF 2014 Labs and
     Workshops, Notebook Papers. CEUR Workshop Proceedings (CEUR-WS.org),
     ISSN 1613-0073, 2014.
4.   E. Castillo, O. Cervantes, D.Vilariño, D. Pinto, and S. León. Unsupervised
     Method for the Authorship Identification Task – Notebook for PAN at CLEF
     2014. In Cappellato, et al. [3].
5.   H.J. Escalante, M. Montes-y-Gómez and L. Villaseñor-Pineda. Particle Swarm
     Model Selection for Authorship Verification. In Proceedings of the 14th
     Iberoamerican Conference on Pattern Recognition, pages 563-570, 2009.
6.   T. Fawcett. An Introduction to ROC Analysis. Pattern Recognition Letters,
     27(8):861-874, 2006.
7.   J. Fréry, C. Largeron, and M. Juganaru-Mathieu. UJM at CLEF in Author
     Identification – Notebook for PAN at CLEF 2014. In Cappellato, et al. [3].
8.   T. Gollub, M. Potthast, A. Beyer, M. Busse, F. Rangel, P. Rosso, E. Stamatatos,
     and B. Stein. Recent Trends in Digital Text Forensics and its Evaluation. In P.
     Forner, H. Müller, R. Paredes, P. Rosso, and B. Stein (eds), Information Access




                                        894
      Evaluation meets Multilinguality, Multimodality, and Visualization. 4th
      International Conference of the CLEF Initiative, 2013.
9.    O. Halvani and M. Steinebach. VEBAV - A Simple, Scalable and Fast
      Authorship Verification Scheme – Notebook for PAN at CLEF 2014. In
      Cappellato, et al. [3].
10.   S. Harvey. Author Verification Using PPM with Parts of Speech Tagging –
      Notebook for PAN at CLEF 2014. In Cappellato, et al. [3].
11.   M. Jankowska, V. Kešelj, and E. Milios. Proximity based One-class
      Classification with Common N-Gram Dissimilarity for Authorship Verification
      Task – Notebook for PAN at CLEF 2013.In P. Forner, R. Navigli, and D. Tufis
      (eds). CLEF 2013 Evaluation Labs and Workshop –Working Notes Papers,
      2013.
12.   M. Jankowska, V. Kešelj, and E. Milios. Ensembles of Proximity-Based One-
      Class Classifiers for Author Verification – Notebook for PAN at CLEF 2014. In
      Cappellato, et al. [3].
13.   P. Juola. Authorship Attribution. Foundations and Trends in IR, 1:234–334,
      2008.
14.   P. Juola. An Overview of the Traditional Authorship Attribution Subtask. In
      Proc. of CLEF’12, 2012.
15.   P. Juola and E. Stamatatos. Overview of the Author Identification Task at PAN-
      2013. In P. Forner, R. Navigli, and D. Tufis (eds). CLEF 2013 Evaluation Labs
      and Workshop –Working Notes Papers, 2013.
16.   M. Kestemont, K. Luyckx, W. Daelemans, and T. Crombez. Cross-Genre
      Authorship Verification Using Unmasking. English Studies, 93(3):340-356,
      2012.
17.   M. Khonji and Y. Iraqi. A Slightly-modified GI-based Author-verifier with Lots
      of Features (ASGALF) – Notebook for PAN at CLEF 2014. In Cappellato, et al.
      [3].
18.   M. Koppel, J. Schler, and S. Argamon. Authorship Attribution in the Wild.
      Language Resources and Evaluation, 45:83–94, 2011.
19.   M. Koppel, J. Schler, and E. Bonchek-Dokow. Measuring Differentiability:
      Unmasking Pseudonymous Authors. Journal of Machine Learning Research,
      8:1261–1276, 2007.
20.   M. Koppel and Y. Winter. Determining if Two Documents are by the Same
      Author. Journal of the American Society for Information Science and
      Technology, 65(1):178-187, 2014.
21.   R. Layton. A Simple Local n-gram Ensemble for Authorship Verification –
      Notebook for PAN at CLEF 2014. In Cappellato, et al. [3].
22.   K. Luyckx and W. Daelemans. Authorship Attribution and Verification with
      Many Authors and Limited Data. In Proceedings of the Twenty-Second
      International Conference on Computational Linguistics (COLING), pages 513-
      520, 2008.
23.   C. Mayor, J. Gutierrez, A. Toledo, R. Martinez, P. Ledesma, G. Fuentes, and I.
      Meza. A Single Author Style Representation for the Author Verification Task –
      Notebook for PAN at CLEF 2014. In Cappellato, et al. [3].




                                         895
24. P. Modaresi and P. Gross. A Language Independent Author Verifier Using
    Fuzzy C-Means Clustering – Notebook for PAN at CLEF 2014. In Cappellato, et
    al. [3].
25. E. Moreau, A. Jayapal, and C. Vogel. Author Verification: Exploring a Large set
    of Parameters using a Genetic Algorithm – Notebook for PAN at CLEF 2014. In
    Cappellato, et al. [3].
26. E. W. Noreen. Computer Intensive Methods for Testing Hypotheses: An
    Introduction. Wiley, 1989.
27. A. Peñas and A. Rodrigo. A Simple Measure to Assess Nonresponse. In Proc. of
    the 49th Annual Meeting of the Association for Computational Linguistics, Vol.
    1, pages 1415-1424, 2011.
28. F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, and G. Inches. Overview of the
    Author Profiling Task at PAN 2013. In P. Forner, R. Navigli, and D. Tufis
    (eds.), Working Notes Papers of the CLEF 2013 Evaluation Labs, 2013.
29. Satyam, Anand, A. K. Dawn, and S. K. Saha. A Statistical Analysis Approach to
    Author Identification Using Latent Semantic Analysis – Notebook for PAN at
    CLEF 2014. In Cappellato, et al. [3].
30. S. Seidman. Authorship Verification Using the Impostors Method – Notebook
    for PAN at CLEF 2013. In P. Forner, R. Navigli, and D. Tufis (eds). CLEF 2013
    Evaluation Labs and Workshop –Working Notes Papers, 2013.
31. E. Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of
    the American Society for Information Science and Technology, 60:538–556,
    2009.
32. E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Automatic Text Categorization
    in Terms of Genre and Author. Computational Linguistics, 26(4):471-495, 2000.
33. C. Sanderson and S.Guenter. Short Text Authorship Attribution via Sequence
    Kernels, Markov Chains and Author Unmasking: An Investigation. In
    Proceedings of the International Conference on Empirical Methods in Natural
    Language Processing, pages 482–491, 2006.
34. B. Stein, N. Lipka and S. Meyer zu Eissen. Meta Analysis within Authorship
    Verification. In Proceedings of the 19th International Conference on Database
    and Expert Systems Applications, pages 34-39, 2008.
35. B. Stein, N. Lipka, and P. Prettenhofer. Intrinsic Plagiarism Analysis. Language
    Resources and Evaluation, 45, pages 63-82, 2011.
36. A.Vartapetiance and L.Gillam. A Trinity of Trials: Surrey’s 2014 Attempts at
    Author Verification – Notebook for PAN at CLEF 2014. In Cappellato, et al.
    [3].
37. H. van Halteren. Linguistic Profiling for Author Recognition and Verification.
    In Proceedings of the 42nd Annual Meeting on Association for Computational
    Linguistics, 2004.
38. B. Verhoeven and W. Daelemans. CLiPS Stylometry Investigation (CSI)
    Corpus: A Dutch Corpus for the Detection of Age, Gender, Personality,
    Sentiment and Deception in Text. In Proc. of the 9th Int. Conf. on Language
    Resources and Evaluation (LREC), 2014.
39. A. Yeh. More Accurate Tests for the Statistical Significance of Result
    Differences. In Proceedings of the 18th Conference on Computational
    Linguistics, Volume 2, pages 947-953, 2000.




                                        896
40. H. Zamani, H. Nasr, P. Babaie, S. Abnar, M. Dehghani, and A. Shakery.
    Authorship Identification Using Dynamic Selection of Features from
    Probabilistic Feature Set. In Proc. of the 5th International Conference of the
    CLEF Initiative, 2014.




                                         897