EPSMS and the Document Occurrence Representation
          for Authorship Identification?
                          Notebook for PAN at CLEF 2011

                                      Hugo Jair Escalante

                           Graduate Program in Systems Engineering,
                            Universidad Autónoma de Nuevo León,
                          San Nicolás de los Garza, NL 66450, México
                                 hugo.jair@gmail.com
                                 http://hugojair.org


         Abstract This paper describes the participation of the PISIS team in the author-
         ship identification track of PAN’11. We adopted two different strategies for the
         tasks of authorship attribution and authorship verification. For authorship attri-
         bution we performed experiments with a document occurrence representation
         using a standard classification-based approach. Results obtained with this ap-
         proach were mixed: in the small data sets distributional representations resulted
         very helpful, although in the large data sets a simple bag-of-words approach out-
         performed the document occurrence approach. For authorship verification we
         adopted a classification-based approach and proposed a modification to Ensem-
         ble Particle Swarm Model Selection (EPSMS) for selecting classification models
         for each task. This approach obtained acceptable performance in two out of the
         three data sets.


1     Introduction
Authorship attribution (AA) and authorship verification (AV) are two closely related
problems that aim at uncovering the writing style of authors [27]. Applications of AA
and AV include spam filtering [30], fraud detection [14], computer forensics [18], cyber
bullying [23] and plagiarism detection [25]. Because of its wide applicability, mainly
in security aspects, the development of automated AA techniques has received much
attention recently [16,25,27].
    AA is defined as the task of identifying whom, from a set of candidates, is the author
of a given document [27]. While AV is the task of deciding whether given text docu-
ments were or were not written by a certain author [17]. Effective methods have been
proposed for both tasks so far, see for example the methods evaluated and/or reviewed
in [28,15,16,25,27]. One of the most popular formulations for AA and AV is that based
in supervised machine learning methods, where both problems are faced as classifica-
tion tasks. More specifically, AA can be faced as one of multiclass classification, with
?
    The author thanks the efforts of the organizers of PAN’11 for providing a valuable forum
    for the evaluation of authorship identification methods. Also, the author thanks the helpful
    comments of reviewers.
as many labels as candidate authors [10,13]. AV, on the other hand, can be faced as a
binary classification problem [9,14].
    This paper describes the approach adopted by the PISIS1 team for the Authorship
Identification track for the PAN2 Lab at CLEF 2011, see [25] for more information on
the PAN competition and workshop series. We adopted classification-based methods
for facing both AA and AV tasks. For AA we used standard classification algorithms
with a distributional term representation for documents. Intuitively, we want to model
the writing style of authors in terms of their association with other documents, as mod-
eled with the document occurrence representation. Experimental results in the PAN’11
Authorship Attribution track show that proposed approach resulted very helpful for the
small data set. Results in the large data set are very competitive as well, although we
found that a simple bag-of-words representation and a nonlinear classifier outperformed
the distributional representations.
    For AV we used a method called Ensemble Particle Swarm Model Selection [6] for
building ad-hoc classifiers for each AV task. We used sample documents by the author
as positive examples and documents written by other authors as negative examples. In
order to obtain stable predictions we adopted a meta-ensemble approach that combines
the outputs of several runs of the model selection technique. Documents are ranked by
probability that they were written by the author of interest. Experimental results in the
PAN’11 Authorship Attribution track show that the proposed approach resulted very
effective for 2 out of 3 data sets, although there are several aspects of the proposed
methodology that still can be improved.
    The rest of this working note describes in detail the methodologies adopted for the
AA and AV tasks, reports the results obtained with them and summarizes our main
findings. Before describing the proposed methodology we briefly review related work
on AA and AV in the next section.


2     Related work
In the classification-based approach to AA and AV sample documents written by each
author are considered instances of an usual classification problem [16,27]. Learning al-
gorithms that have been used for AA and AV include support vector machine (SVM) [10,17,13]
and variants thereon [24], neural networks [29], Bayesian classifiers [2], decision tree
methods [16] and similarity based techniques [18,16] among several others.
    In the above works the same learning algorithm have been used for building the
classification models of all of the authors in consideration. An exception is the work by
Escalante et al. [9], where particle swarm model selection (PSMS) was used for building
specific classification models for each author. The hypothesis of that work is that by
considering specific methods for preprocessing, feature selection and classification for
each author will increase the classification performance. Satisfactory performance was
obtained in the task of AV (i.e., binary classification), although AA performance (i.e.,
multiclass classification) was limited (because of the incompatibility of scales for the
outputs of different models, see [8]). Since PSMS has proved to be very effective for
 1
     http://pisis.fime.uanl.mx/
 2
     http://www.webis.de/research/workshopseries/pan-11/
diverse binary classification tasks [5,6,9] in this paper we adopt a modified PSMS for
the AV task and we used standard learning algorithms for the AA task in the PAN’11
Authorship Identification track.
    While standard learning algorithms have been used for AA and AV, a wide diver-
sity of features have been used for representing documents, including, character, lexical,
syntactical, grammatical and semantic, among others [12,16,27]. Nevertheless, the most
used representation is still the one based on the bag-of-words formulation. In particular,
the bag-of-words formulation using character n-grams has terms have been successfully
used by several researchers [9,10,13,22]. In this paper we adopted an extended bag-
of-words representation for documents called the document occurrence representation
(DOR) [19]. Under DOR documents are represented by a distribution of occurrences
over other documents in the corpus, in such a way that documents are represented by
their context. DOR has been successfully used in term clustering [21], word sense dis-
ambiguation [11] and multimedia image retrieval [7].


3     Authorship verification

Three AV tasks we evaluated in the PAN’11 authorship identification track. For each
task organizers provide sample documents written by the author (training set) and doc-
uments written by the author and other authors (validation set). The developed method
was tested in documents from the test set (of course, labels in the test set were not avail-
able to participants during the competition). Since both, training and validation data are
available during development we merged the documents in the training and validation
sets for training our method. Table 1 shows the number of documents written and not
written by the author of each data set in the training, validation and test sets.


Table 1. Number of documents written (Y) and not written by the author (N) in the training,
validation and test sets.

                    Data set      Training    Validation       Test
                                  Y          Y     N       Y     N
                    Verify-1+     42         3     104     3     92
                    Verify-2+     55         3     95      5     101
                    Verify-3+     47         3     100     4     89


3.1   Features

In our approach to AV we used documents in the training and validation sets as training
data for training a classifier that discriminates between documents written by the author
and documents written by any other author. Documents were represented by their bag-
of-words using character n-grams as terms, with n = 3. Spaces and punctuation marks
were considered characters. We did not use the distributional term representation for
this task because of the small number of documents in the training and validation sets,
see Section 4.


3.2    Classification approach

Once that documents are represented by their bag-of-words we used Ensemble Par-
ticle Swarm Model Selection (EPSMS) for the selection of classification models for
each data set. EPSMS is a method for the automatic selection of binary classification
models [6]. In a nutshell, EPSMS searches for the best ensemble method that can be
generated by using the methods available in a machine learning toolbox 3 . An ensemble
is a classification model that combines the outputs of several classifiers. Under cer-
tain conditions, it has been shown that ensembles can achieve better performance than
individual models [3]. In previous work we have shown that EPSMS can select very
effective ensemble classification models [6,4]. A distinctive feature of EPSMS is that
each of the members of the ensemble is a method that differs in terms of preprocessing
method, feature selection technique and learning algorithm. The heterogeneity of the
considered models (diversity) together with the competitive accuracy (performance) of
models guarantee selecting very effective classification models. See [6,4,5] for further
details on EPSMS.
     For each AV data set we provide as input to EPSMS the training+validation data and
EPSMS returns a ensemble classifier. Although EPSMS provides very stable classifi-
cation models [6,4] in this work we wanted to obtain even more stable models. There-
fore, we adopted a meta-ensemble approach in which the outputs of several ensembles
(each one selected with EPSMS) were combined. The intuition behind this technique is
that by running EPSMS several times and combining the outputs of the corresponding
methods we could more stable predictions. Stability is very important in EPSMS as this
method is based in a heuristic search method, besides the search space contains many
local minima.
     The meta-ensemble approach is as follows. For each AV data set we ran EPSMS 5
times. Then the selected ensembles were applied to the test data set. As a result we have
for each test document the five outputs provided by the 5 ensembles. The output of each
ensemble is a real number between [0, 1] expressing the confidence that the sample
belongs to the positive class. The outputs of each ensemble are sorted in descending
order in such a way that test documents that are more likely to belong to the positive
class (i.e., documents written by the author) are ranked in the first positions. For each
ranking we keep the top-10 ranked documents. Then, for each document in the union
of the 50 documents we count the number of rankings in which they appeared within
the top-10 positions(a number between 1 and 5). Finally, we sort the test documents by
this number and assign the positive label to the top 10 ranked documents.
     Our hypothesis with this method is that if the document is likely to be written by
the author it is very likely that the document will receive a high score from several
EPSMS ensembles. Documents not written by the author of interest may appear in the
top ranked documents for one or two ensembles, although it is reasonable to assume
that the top ranked documents are those with more chances to be written by the author.
 3
     http://clopinet.com/CLOP
The choice of the top 10 ranked documents was done by analyzing the outputs of the
different ensembles. We found that after 10 documents most of the documents received
very similar scores in the test set.


4     Authorship attribution
As mentioned in the related work section, the performance of PSMS [5] and EPSMS [8,4]
for multiclass classification models is not as good as for binary classification tasks.
Therefore, we decided to adopt a different approach for the AA task. In particular, we
focused on the evaluation of an extended bag-of-words representation for documents
and used a standard classification model. Table 2 summarizes the main statistics of the
AA data sets for the PAN’11 Authorship Identification track.


Table 2. Description of the AA data sets. Standard data sets are those not including additional
authors not available during training. While plus data sets (+) may include documents written by
authors that were not represented in the training set. For each data set and each partition we show
the number of documents and between parentheses the number of classes.

                       Data set       Training     Validation Test
                       Small          3001 (26)    518 (23) 495 (23)
                       Small+         3001 (26)    601 (43) 634 (45)
                       Large          9337 (72)    1298 (66) 1300 (64)
                       Large+         9337 (72)    144 (86) 1416 (87)


4.1   Features
The bag-of-words representation using character n-grams as terms is among the most
used representations for documents in AA [9,10,13,16,22,27]. Despite the fact that ac-
ceptable performance has been obtained with such representation in AA, we think that
results obtained with such representation can be improved by adopting extended rep-
resentations. Several extensions to the bag-of-words approach has been proposed in
closely related fields as information retrieval [1], computational linguistics [19] and
machine learning [20]. In this work we explore the suitability of the document occur-
rence representation (DOR) for document representation in AA.
    DOR is a distributional term representation in which a document is represented by a
distribution of occurrences over other documents in the same corpus [19]. Intuitively, a
document is represented by its context. The process for obtaining the DOR representa-
tion for documents is as follows. First, each term in the vocabulary is first represented as
a distribution of occurrences over documents. Next, each document is then represented
by a combination of the representations of terms that occur in the document.
    DOR is considered the dual of the tf-idf representation for representing documents:
as documents can be represented by a distribution over the terms, terms can be repre-
sented by a distribution over documents. Each term tj in the vocabulary V is represented
by a vector of weights wdor j  =< wj,1 dor            dor
                                           , . . . , wj,N >, where N is the number of docu-
                                    dor
ments in the collection and 0 ≤ wj,k ≤ 1 represents the contribution of document dk to
the representation of tj . Specifically, we consider the following weighting scheme [19]:

                                                                 |V | 
                         wdor (tj , dk ) = df (tj , dk ) × log                          (1)
                                                                 Nk

where Nk is the number of different terms that appear in document dk and df (tj , dk ) is
given by:
                                                    
                                  1 + log #(tj , dk ) if #(tj , dk ) > 0
                df (tj , dk ) =                                                      (2)
                                                     0 otherwise

where #(tj , dk ) denotes the number of times term tj occurs in document dk . The
weights are normalized using cosine normalization. Intuitively, the more frequent the
term tj occurs in document dk , the more important dk is to characterize the semantics
of tj ; on the other hand, the more different terms occur in dk , the less it contributes to
characterize the semantics of tj .
    Once that each term is represented according to Formula (1) each document is rep-
resented by the unweighed sum of the representations of terms that appear in the docu-
ment. In this way, a document is represented as a distribution of occurrences over other
documents in the collection. Our hypothesis on the use of DOR for AA is that the ex-
panded representations are more descriptive than the usual bag-of-words approach. We
did not use this representation for the AV task because the number of documents in
the different tasks are very small (see Table 1), which resulted in very low dimensional
representations of documents.


4.2   Classification approach

For classification we used the neural network classifier implemented in the CLOP tool-
box [26]. We selected this classifier after performing a preliminary evaluation of several
classification algorithms. We found that the combination of DOR representation and
neural network classifier achieved the highest performance in the validation data sets.
For the standard data sets, see Table 2, we used a straight multiclass classifier (one-
vs-all approach), where a class corresponds to an author. For the plus data sets (i.e.,
data sets that contain documents not written by any author in the training set). We used
a multiclass classifier with an extra class: unknown author. We just considered docu-
ments not written by any author in the training set as another author. Recall, we used
training+validation data for training the classifiers.


5     Evaluation

This section reports the results obtained with the proposed methods in the authorship
attribution track of PAN’11. We first analyze the performance of the AV methods and
then that of the AA techniques.
5.1     Authorship verification
Table 3 shows the results obtained in the AV data sets. The results are mixed: our
EPSMS approach obtained the first position in the second data set, although it was
ranked ninth in the third data set. For the data set Verify-1, a single document written
by the author (out of three available) was identified, this document was ranked second
according to the weights generated with the meta-ensemble approach. The other two
relevant documents did not appear in the top ranked documents for any of the 5 en-
sembles. For the Verify-2 data set 4 out of 5 documents were identified by the EPSMS
approach, while no author was correctly identified for the Verify-3 data set 3.


      Table 3. Experimental results in the AV task. We show Precision, Recall and F1 measure.

                    Data set Precision Recall F1 Sum-Ranks Overall Rank
                    Verify-1 0.1       0.333 0.154 17      6th out of 10
                    Verify-2 0.4       0.8    0.533 11     1st out of 10
                    Verify-3 0         0      0     30     9th out of 10


     From Table 1 we can see that the problems are imbalanced and the fact that nega-
tive examples (documents written by other authors) are made of documents from dif-
ferent authors further complicated the classification problem. Nevertheless, the results
obtained with the proposed formulation are interesting and give evidence that the clas-
sification approach to AV can be very effective. We believe the proposed method has
potential for this and other binary classification tasks, although we would like to con-
duct an extensive evaluation of the proposed approach in order to detect what factors
influence the performance of the proposed technique. A limitation of the proposed ap-
proach is that it ranks documents that are more likely to be written by the author, and
then a threshold (top 10-ranked documents) must be used for determining what doc-
uments were written by the author. In future work we would like to study alternative
formulations for the combination of the outputs from different ensembles.

5.2     Authorship attribution
Table 4 shows the official results obtained by our methods in the AA task. Overall, we
can say that results were very competitive. Our entries were above the average perfor-
mance among other participants. The results were particularly positive in the Small data
sets where our method is ranked second and third. Interestingly, the DOR representation
resulted more helpful for the data sets that included authors not seen in the training set.
Giving evidence that a classification approach for modeling unknown authors can be an
effective solution for this AA scenario. The performance in terms of macro and micro
average measures was proportional.
    In order to evaluate the advantage of the DOR representation over a standard bag-of-
words formulation we performed post-competition experiments4 . We performed exper-
 4
     Participants were provided with the labels for test set documents after the competition finished.
Table 4. Experimental results in the AA task. We show Macro Average (MA) and Micro Average
(MI) Precision (P), Recall (R) and F1 measure (F1). Column Sum-Ranks shows the sum of
ranks across the different measures, we also show the overall ranking achieved by each entry.

       Data set MA-P MA-R MA-F1 MI-P MI-R MI-F1 Sum-Ranks Overall Rank
       Small 0.676 0.381 0.387 0.709 0.709 0.709 19       3rd out of 17
       Small+ 0.65 0.201 0.193 0.578 0.573 0.575 16       2nd out of 13
       Large 0.608 0.294 0.303 0.508 0.508 0.508 48       8th out of 18
       Large+ 0.53 0.203 0.191 0.446 0.446 0.446 29       5th out of 13


iments using the same classification-based approach described in Section 4, although
using a binary bag-of-words representation with character n-grams as terms. The same
neural network with the same (default) parameters as used with the DOR representation.
Table 5 shows the performance obtained by the classifier with both representations.


Table 5. Experimental results in the AA task using DOR and the bag-of-words (BOW) repre-
sentations. We show Macro Average and Micro Average F1 measure, accuracy and the rank that
would be obtained by the different representations using only the F1 measure values. For the plus
data sets we were unable to reproduce the performance measurements provided by the organizers,
therefore we do not show the computed results for those data sets.

                 Data set Accuracy Macro-F1 Micro-F1 Sum-ranks
                         DOR BOW DOR BOW DOR BOW DOR BOW
                 Small 70.91 67.88 0.387 0.418 0.709 0.678 3  4
                 Small+ 57.25 55.20 0.709 0.552 0.193 -     2 2
                 Large 50.76 62.53 0.303 0.463 0.507 0.6254 8 3
                 Large+ 44.56 53.24 0.446 0.532 0.191 -     5 2


    Results are mixed: for the Small data sets the DOR representation outperformed
the performance of the bag-of-words formulation. While the improvement in terms of
accuracy is considerable the ranking of both methods was not significantly affected. On
the contrary, in the Large data sets the bag-of-words approach outperformed the DOR
representation. The differences in all of the measures are considerable (more than 10%
in accuracy). Note that the ranking for the Large data sets are considerably reduced for
the bag-of-words approach. This result was somewhat unexpected, as one may think
that since in the Large data sets we have more documents available, the DOR represen-
tations can be more informative (richer). We think that this results are due to the fact that
having more classes there can be an overlap in the representations for documents that
belong to different classes. We will try to clarify this behavior in future work. Another
issue could be that the number of documents over which compute the DOR representa-
tion (and even the selection of which documents are used) can have an important impact
into the performance of methods based on this representation.
6   Conclusions
We described the methods adopted for the PAN’11 Authorship Identification track. Dif-
ferent methods were proposed for the attribution (AA) and verification (AV) tasks. For
AV we used EPSMS a tool for the automated selection of ensemble classifiers. Our
results show that EPSMS is a very competitive method although it still can further im-
proved. In particular we would like to study different ways to determine that a document
has/hasnot been written by an author from the outputs of several ensembles selected
with EPSMS. For AA we adopted the document occurrence representation and used a
standard classifier. We found that in the Small data sets the DOR representation resulted
very helpful, although it was not the case for the Large data sets. It is interesting, and
somehow disappointing, that a simple bag-of-words representation outperformed the
DOR-based approach in the Large data sets. We would like to analyze in more detail
the benefits of DOR for AA and what factors affect the performance of methods based
on that representation.


References
 1. Carrillo, M., Eliasmith, C., Lopez-Lopez, A.: Combining text vector representations for
    information retrieval. In: Proc. of the 12th International Conference on Text, Speech and
    Dialogue (TSD). LNCS, vol. 5729, pp. 24–31. Springer (2009)
 2. Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship
    attribution using word sequences. In: Proc. of 11th Iberoamerican Congress on Pattern
    Recognition. LNCS, vol. 4225, pp. 844–852. Springer, Cancun, Mexico (2006)
 3. Dietterich, T.: Ensemble methods in machine learning. In: Proc. of the First workshop on
    Multiple Classifier Systems. LNCS, vol. 1857, pp. 1–15. Springer (2000)
 4. Escalante, H.J., Altamirano, L., Gonzalez, J.A., Montes, M., Gomez, P., Reta, C., Reyes,
    C.A., Rosales, A.: Acute leukemia classification with ensemble particle swarm model
    selection. Artificial Intelligence in Medicine Submitted (2011)
 5. Escalante, H.J., Montes, M., Sucar, E.: Particle swarm model selection. Journal of Machine
    Learning Research 10(Feb), 405–440 (February 2009)
 6. Escalante, H.J., Montes, M., Sucar, E.: Ensemble particle swarm model selection. In: Proc.
    of the World Congress on Computational Intelligence. pp. 1814–1821. IEEE, Barcelona,
    Spain (2010)
 7. Escalante, H.J., Montes, M., Sucar, E.: Multimodal indexing based on semantic cohesion
    for image retrieval. Information Retrieval In press (2011)
 8. Escalante, H.J., Montes, M., Sucar, L.E.: An energy-based model for image annotation and
    retrieval. Computer Vision and Image Understanding 115(6), 787–803 (2011)
 9. Escalante, H.J., Montes, M., Villaseñor, L.: Particle swarm model selection for authorship
    verification. In: Proc. of the 14th Iberoamerican Congress on Pattern Recognition. LNCS,
    vol. 5856, pp. 563–570. Springer, Guadalajara, Mexico (2009)
10. Escalante, H.J., Solorio, T., Montes, M.: Local histograms of character n-grams for
    authorship attribution. In: Proc. of the 49th Annual Meeting of the Association for
    Computational Linguistics. pp. 288–298. ACL (2011)
11. Gale, W.A., Church, K.W., Yarowsky, D.: A method for disambiguating word senses in a
    large corpus. Computers and the Humanities 26(5), 415–439 (1993)
12. Grieve, J.: Quantitative authorship attribution: An evaluation of techniques. Literary and
    Linguistic Computing 22(3), 251–270 (2007)
13. Houvardas, J., Stamatatos, E.: N-gram feature selection for author identification. In: Proc.
    of the 12th International Conference on Artificial Intelligence: Methodology, Systems, and
    Applications. LNCS, vol. 4183, pp. 77–86. Springer, Varna, Bulgaria (2006)
14. Iqbal, F., Khan, L.A., Fung, B.C.M., Debbabi, M.: E-mail authorship verification for
    forensic investigation. In: Proc. of the 2010 ACM Symposium on Applied Computing. pp.
    1591–1598. SAC ’10, ACM, New York, NY, USA (2010),
    http://doi.acm.org/10.1145/1774088.1774428
15. Joula, P.: Authorship attribution. Foundations and Trends in Information Retrieval 1(3),
    233Ű334 (2006)
16. Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution.
    Journal of the American Society for Information Science and Technology 60, 9–26 (2009)
17. Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In:
    Proc. of the twenty-first international conference on Machine learning. pp. 62–. ICML 04,
    ACM, New York, NY, USA (2004),
    http://doi.acm.org/10.1145/1015330.1015448
18. Lambers, M., Veenman, C.J.: Forensic authorship attribution using compression distances
    to prototypes. In: Computational Forensics, Lecture Notes in Computer Science, Volume
    5718. ISBN 978-3-642-03520-3. Springer Berlin Heidelberg, 2009, p. 13. LNCS, vol. 5718,
    pp. 13–24. Springer (2009)
19. Lavelli, A., Sebastiani, F., Zanoli, R.: Distributional term representations: An experimental
    comparison. In: Proc. of the International Conference of Information and Knowledge
    Management. pp. 615–624. ACM Press (2005)
20. Lebanon, G., Mao, Y., Dillon, J.: The locally weighted bag of words framework for
    document representation. Journal of Machine Learning Research 8, 2405–2441 (2007)
21. Lewis, D.D., Croft, W.B.: Term clustering of syntactic phrases. In: Proc. of the 13th
    International ACM SIGIR Conference on Research and Development in Informaion
    Retrieval. pp. 385–404. ACM Press, Bruxelles, Belgium (1990)
22. Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and
    limited data. In: Proc. of the 22nd International Conference on Computational Linguistics.
    vol. 1, pp. 513–520. ACM Press, Manchester, UK (2008)
23. Pillay, S.R., Solorio, T.: Authorship attribution of web forum posts. In: Proc. of the eCrime
    Researchers Summit (eCrime), 2010. pp. 1–7. IEEE, Dallas, TX, USA (2010)
24. Plakias, S., Stamatatos, E.: Author identification using a tensor space representation. In:
    Proc. of the 18th European Conference on Artificial Intelligence. vol. 178, pp. 833–834.
    IOS Press, Patras, Greece (2008)
25. Potthast, M., Stein, B., Barrón, A., Rosso, P.: An evaluation framework for plagiarism
    detection. In: Proc. of the 23rd International Conference on Computational Linguistics
    (COLING 2010). pp. 997–1005. ACL (August 2010)
26. Saffari, A., Guyon, I.: Quickstart guide for CLOP. Tech. rep., Graz University of
    Technology and Clopinet (May 2006)
27. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the
    American Society for Information Science and Technology 60(3), 538–556 (2009)
28. Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E.: Proc. of the 3rd international
    workshop on uncovering plagiarism, authorship, and social software misuse, PAN’09
    (2009)
29. Tearle, M., Taylor, K., Demuth, H.: An algorithm for automated authorship attribution using
    neural networks. Literary and Linguist Computing 23(4), 425–442 (2008)
30. de Vel, O., Anderson, A., Corney, M., Mohay, G.: Multitopic email authorship attribution
    forensics. In: Proc. of the ACM Conference on Computer Security - Workshop on Data
    Mining for Security Applications. Philadelphia, PA, USA. (2001)