<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Feature Bagging for Author Attribution</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>François-Marie Giraud</string-name>
          <email>giraudf@poleia.lip6.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thierry Artières</string-name>
          <email>thierry.artieres@lip6.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIP6, Université Pierre et Marie Curie (UPMC)</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The authorship attribution literature demonstrates the difficulty to design classifiers overcoming simple strategies such as linear classifiers operating on a number, most frequent, of lexical features such as character trigrams. We claim this comes, at least partially, from the difficulty to efficiently learn the contribution of all features, which leads to either undertraining or overtraining of classifiers. To overcome this difficulty we propose to use bagging techniques that rely on learning classifiers on different random subset of features, then to combine their decision by making them vote.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        A key issue in author attribution and verification lies in feature definition and selection,
which motivated many studies [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. One conclusion is that despite many efforts
to build smart features [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] very simple ones such as counts (or tfidf like features) of
words and/or of character n-grams are commonly used. Moreover feature selection is
performed using simple criterion such as choosing the most frequent words and
character ngrams. Finally, simple classifiers such as linear SVM have been shown to perform
well with above features and such simple systems appear to be difficult to outperform
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Our work is an attempt to outperform such a simple, and efficient, strategy. It is
inspired from two key observations that have been made in the past.
      </p>
      <p>
        First, it has been observed that learning rich models on few training data may yield a
form of undertraining [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] where some relevant features are not fully taken into account
by the model after training. This may happen when a number of features (not necessarily
many) are sufficient, alone, for perfect discrimination of the training samples. In that
case learning may focus on learning good weights for few of these relevant features
while neglecting remaining relevant features. Then if only the neglected discriminative
features occur in a test sample it will be misclassified. This has been observed in
particular in the context of text processing with log linear models where one usually exploits
a huge number of features and where training samples are often linearly separable with
a small subset of the features.
      </p>
      <p>
        Second, the work by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] suggests that an author’s witting style is characterized
by a limited number of discriminative features and more importantly by the way the
classifier performance behaves (i.e. accuracy drops) when most important features (e.g.
having large weights after SVM learning) are iteratively removed.
      </p>
      <p>We investigate here new methods that take into account the two above results to
design efficient classifiers for authorship attribution. They both rely on bagging ideas
where one combines the results of a number of classifiers that are learned on training
samples represented with a random subset of features.</p>
      <p>We first draw a panorama of related works in section 2 then we provide in section 3
details on the datasets that we used in this paper in addition to the PAN 2012 challenge
datasets. Next we introduce our general idea and investigate the potential interest of
feature bagging in section 4 and we present our approach in section 5.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related works</title>
      <sec id="sec-2-1">
        <title>Features</title>
        <p>Designing good features is a key issue for author identification, many features have been
investigated up to now. These may be grouped in a few categories; lexical features,
syntactic features, structural features, and contextual features. We briefly review all
these now.</p>
        <p>
          Lexical features
– TF-IDF (term frequency - inverse document frequency): Tf-idf are standard
features used in text processing, information retrieval, that consist in counting words’
occurrences and weighting these counts by words’ document frequency to decrease
the influence of frequent and uninformative (with respect to the topic of the text)
words [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Using Tf-idf yields representing a document in a very high dimensional
space (there are one feature per word in the vocabulary). One may reduce the
dimension of the feature space by selecting features using various measures such as
information gain [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
– Word length: Statistics on word length has been used in e.g. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. It is a simple and
easy to compute feature but it has a low discriminative power.
– Sentence length.
– Richness of the vocabulary: This may be computed as the number of different words
used by an author. Again it is a simple and easy to compute feature but with a low
discriminative power.
– Word N-grams: These features are counts of the number of occurrences of N
successive words. One only considers unigrams (N=1), bigrams (N=2) and trigrams
(N=3). One can use simple counting or Tf-idf like scores [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Of course one cannot
consider all N-grams which are much too numerous, and one has to select a priori
the most useful ones.
– Character N-grams: These features are similar to previous ones but we are
interested in tuples of characters not word [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Interestingly these features have been
shown to be efficient in a number of tasks on text data.
– CW: This is a short name for TF-IDF features computed for the 1 000 words with
highest information gain (after [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ])
– CNG: This is a short name for TF-IDF features computed for the 1 000 trigrams
(of words) with highest information gain (after [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ])
Syntactic features
– Linking words: Counting features (simple counts or TF-IDF like normalized counts)
for particular words: conjunction, preposition, pronoun, modal verbs, ...
– Part Of Speech (POS): Counting features (simple counts or TF-IDF like normalized
counts) on a tagged representation of the text; nouns, adjectives, verbs, singular,
plural, ... [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
– POS N-grams: N-gram on POS tags [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
        </p>
        <sec id="sec-2-1-1">
          <title>Structural features</title>
          <p>
            – Font size [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]
– Font color [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]
– Number of images in the document [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]
– Number of hyper links [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]
          </p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Contextual features</title>
          <p>
            – Topic(s) of the document [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]
– Elongation and inflexion of Arabic words [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]
          </p>
          <p>The difficulty to find good features for author identification lies in that author
signature is embedded in many other information in the text that concern the topic, the
opinion, etc. As far as we can imagine from our own way to guess the author of a text
we focus on very particular construction, the use of particular words etc, in other words
we look at any unusual difference with a mean way of writing. Also it is more likely
that the most discriminative features for one author are very dependent on the author
and cannot be guessed a priori.</p>
          <p>Then, one most often uses an eventually large number of features and let the
classifier decide which ones are useful or not for the targeted author identification task.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>State of the art</title>
        <p>Few classification methods have been investigated to operate on documents
representated by a subset of features taken from the list above. Mainly two families of methods
have been used: methods coming from the information retrieval field (dot product or
any similarity measure on vectorial representations of documents), and methods from
the statistical machine learning field such as Support Vector Machines (SVM). Table 1
compares few results from the literature in terms of corpus, of classification method,
and of the features used to represent a document.</p>
        <p>We want to comment a little on this table and on the studies from which these results
are taken.</p>
        <p>
          First of all, the work in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] on a French corpus focused on measuring the
relevance of using kernelized SVM instead of simpler linear ones. Although they showed
improved accuracy it is an isolated work. The literature shows on the contrary that
linear SVM are popular and efficient in the author identification field. These models are
powerful enough when used with a large number of features, as it as been demonstrated
in many text classification tasks, they often allow perfect classification on the training
set.
        </p>
        <p>
          [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] compared the efficiency of various feature sets with different classification
methods and showed that best results were achieved with SVM working with CW and CNG
features on a few datasets. Besides, the studies of extremist posts (KKK for English and
Palestinian and Al-Aqsa Martyrs group for Arabic) concerned 5 authors only but the
study demonstrated the usefulness of structural and contextual features in this
particular context [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. A conclusion of studies on the discriminative power of various features
vary and it appears difficult to determine definitely a set of discriminative features.
        </p>
        <p>At the end, few works aimed at designing new classification methods dedicated to
author identification. Linear SVMs appear to be a good compromise, and the key issue
is rather to determine which are the moste useful features, this appears as the main
question to get good results.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Datasets and experimental settings</title>
      <sec id="sec-3-1">
        <title>Datasets</title>
        <p>We report experimental results gained on the PAN 2012 challenge datasets and on two
additional datasets on which we have been able to perform numerous experiments in
order to characterize the behaviour of our method. We provide here details on the two
additional datasets, details one the PAN 2012 challenge may be found on the challenge
website.</p>
        <p>
          The first additional dataset that we use is an english literature corpus used in some
previous publications ([
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]). It is very similar to the corpus used in the PAN 2012
challenge: There are 9 authors and 2 complete books per author. There are an average of
100 thousands words by book and every book was divided manually in about a hundred
documents, keeping integrity of chapters and of text sections. A large majority of the
documents are about 500 to 3000 words length.
        </p>
        <p>
          The second additional corpus is a subset of a corpus of blogs with about 18 000
authors [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. We worked on a subset of this corpus considering only the 60 main authors
(bloggers), i.e. those who post frequently (at least 20 posts) posts that are longer than
100 words.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Experimental settings</title>
        <p>In all reported experiments we used linear support vector machines (SVMs) as
classifiers since they have been shown to provide state of the art results in many text
processing and classification tasks and in particular for author authentification.</p>
        <p>
          We used the LibSVM library [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] where a multiclass classifier for a N class
classification problem is implemented through the learning of N (N 1)=2 one-to-one
binary SVMs. All classifiers are learned with a standard L2 regularization term (to avoid
overfitting) whose weight is set on the validation dataset.
        </p>
        <p>Note also that we exploited the probabilistic variant of SVMs as implemented by
LibSVM. When the outputs of multiple SVMs are to be compared in order to take a
decision (see section 6) we naturally take the decision corresponding to the SVM with
the biggest output.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Unexploited features and classifier undertraining</title>
      <p>
        We hypothesize that undertraining as described in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] often occurs in authorship
attribution tasks. Actually we observed on many datasets that SVMs working on many
lexical features (word or character trigrams counts or tf-idf) easily reach 100%
accuracy on the training set while performance on the test set may be significantly below,
which is symptomatic of an overtraining problem. Yet, as suggested by [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] it may
be more accurately understood as an undertraining problem when considering linear
models exploiting a large number of features.
      </p>
      <p>Indeed when the classifier is rich enough (e.g. a linear classifier exploiting a
number of discriminative features) it may happen that some relevant features are not fully
taken into account by the learned model. This may happen when a number of features
(not necessarily many) are sufficient, alone, for perfect discrimination of the training
samples. Then training may focus on exploiting some of the relevant features allowing
perfect classification on the training set, while ignoring some other relevant features. In
such a case, if a test sample includes neglected discriminative features only it will be
misclassified. It is a form of undertraining that may occur when training samples are
linearly separable with a small subset of the features, e.g. when having many features
and/or few training data.</p>
      <p>Figure 1 reports preliminary experimental results that yield thinking there is
actually some kind of undertraining in SVMs learned on authorship attribution tasks with
many simple (lexical) features. It plots the accuracy of a linear SVM (Support Vector
Machine) exploiting a limited number of features, X, ranging from 10 to 350 chosen
from a set of 2 500 features (2 500 most frequent character trigrams) as a function of X.
Plots are for the training dataset (top) and for the test set (bottom). Both plots provide
two curves corresponding to choosing the X features at random or by selecting the X
most frequent features, a line which corresponds to a linear SVM exploiting all 2 500
features, and an additional curve for the accuracy of a SVM using all 2 500 features
minus the X most frequent features.</p>
      <p>These figures put in evidence some interesting facts. As may be seen the
performance of classifiers exploiting only few features is very high on the training set when
using the most frequent features, quickly reaching 100% perfect classification, (which
is also true when using all features) while it reaches a plateau on the test set at about
80% accuracy. There is then a strong gap between the performance on the training set
and on the test set which show an overtraining problem. Also the accuracy of a SVM
exploiting all 2 500 features minus the most frequent ones is very high both on the training
set and on the test set, which shows these features contain discriminative information
too.</p>
      <p>Indeed the figures also show that using few random features also allow to
discriminate between authors up to a certain extent, which means all features (including less
frequent ones) contain some discriminative information. It is likely that the learning of
a SVM will focus on exploiting most frequent features so that at the end one may
expect that SVMs will not necessarily fully exploit all discriminative features since only
few of them (most frequent) already allow reaching 100% accuracy on the training set.
As a consequence there exist a number of discriminative features that are neglected by
during learning and that might improve generalization.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Bagging features for improved author authentification</title>
      <p>Based on the discussion above we aim at designing approaches able to fully exploit the
potential of all available features. We investigated methods relying on bagging features,
i.e. learning many classifiers on different subsets of the features then combining their
predictions.
5.1</p>
      <sec id="sec-5-1">
        <title>Principle</title>
        <p>
          Many methods have been proposed for combining classifiers such as co-training,
boosting, bagging, a number of which have been designed or adapted for working with
classifier exploiting different subsets of features [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. In particular, feature bagging
has been investigated by a few researchers in the past [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Viola &amp; jones [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
used boosting with extremely weak classifiers (learned on a single feature each) every
iteration. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] also used boosting with an adaptation of AdaBoost to feature weighting
instead of samples weighting as in AdaBoost.
        </p>
        <p>In this preliminary study we decided to investigate a standard bagging combination
where an eventually large number of base classifiers that are learned on random subsets
of the features (with eventual overlap) and that are combined at test time thourgh a
voting procedure. In practice we investigated using a majority vote decision process with
a number of SVM classifier trained on many (hundreds to thousands) random subsets
of few (tens to hundreds) features. SVM classifier are learned with libsvm toolbox(see
section 3).
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Experimental results</title>
        <p>Preliminary experiments Preliminary results were obtained on the 60 bloggers
corpus.</p>
        <p>All the results presented in this section have been obtained on a single learning/validation/test
split to limit computational complexity. All the models are trained on the learning set
and the validation test is used to determine the optimal value of the SVM regularization
parameter.</p>
        <p>First we investigate the influence of the number of random features K exploited by
the base classifiers and of the number of base classifiers M . Figure shows the evolution
ot the system’s accuracy as a function of the number of base models. There are few
curves corresponding to a different number of random features used by a base classifier.
The features used are chosen from a set of 3 000 most frequent character trigrams. As
may be seen the value of K influences the performance of the overall approach and it
seems better to use a small value here K, probably yielding more variability between
all base classifiers. Besides, it looks like the more there are base classifiers the higher
the accuracy, in particular when designing base classifiers working on a small number
of random features.</p>
        <p>Table 2 provides some interesting statistics on the base classifiers, the mean
accuracy, the minimum accuracy (among all base classifiers) and the maximum accuracy
on the training, the validation and the test datasets. It shows in particular that the more
features the base classifier exploit the higher is the average accuracy, but it shows also
that a single base classifier is not a good performer alone.</p>
        <p>Table 3 compares the performance of the bagging feature approach with that of a
single SVM working witl all features. One may see here that whatever the number of
random features used by base classifiers, the bagging approach systematically
outperforms a SVM exploiting all the features. This justifies a posteriori our discussion on the
undertraining phenomenon of classifiers in the context of author identification.</p>
        <p>These results show a significative improvement of the bagging feature approach
over a single SVM classifier exploiting all the features.
PAN’12 challenge For the PAN challenge dataset, we learned models on a number of
splits of the provided training corpus into pairs of learning/validation datasets. For each
of these S training/validation splits, we learned M models based on K randomly selected
features from the initial set of features (word or caracter trigram counts) (see Table 4).
We finally take decision over (S M ) models learned each with K random selected
features in the T initial features. For closed problem, we simply used a majoritary vote
to design prediction on test data. For open problems we use same models and vote
method but fixed a threshold on each author based on validation results below wich we
consider that none of the condidates is the real author. The table below summarize by
task ours submitted methods. In the table the initial set of features is described by the
feature type (i.e word-count or character-n-gram count) and by the number T of most
frequent features keeped from the training set. Accuracy of our approach for a variety
of hyperparameters (M , K, etc) values are given in the table.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Two stage approach</title>
      <p>
        To go further we built on the work of [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] who suggested that an author’s witting style
is characterized by a limited number of discriminative features and more importantly
by the way the classifier performance behaves (i.e. drops) when most important features
(having large weights after learning) are iteratively removed. We designed a method that
is inspired by this work and exploits weak SVM classifiers learned on random subsets
of features.
We explain now the principle of this approach. Consider an author classification
problem with N authors (classes) and where documents are represented by p-dimensional
feature vectors. The system we propose is a two stage system.
      </p>
      <p>In a first stage we learn N linear multiclass SVMs exploiting random subsets (of
size X ) of the p original features as before in section 5. We note Si [1; d] the set of
indices of features used by the ith SVM and SV Mi the ith classifier. These classifiers
are learned to affect a document to one of the N authors.</p>
      <p>Then we use the classifiers of the first stage to build new vectors (that we call
profiles) that will be processed in a second stage by another classifier. We start by
describing how the first stage is used to build a new training dataset which we call the
second stage dataset. For any author a 2 [1; N ] and for any document d, we build a
new p-dimensional feature vector (a profile) whose jth component is the proportion of
classifiers (among the N classifiers that exploit feature j) that predict author a. More
formally the profile for a particular pair (document,author) which we note u(d; a) is a
vector whose components are defined as:
uj (d; a) =</p>
      <p>1
Z(j)</p>
      <p>X
i=1:K
(j 2 Si)
(SV Mi(d) == a)
(1)
where SV Mi(d) stands for the output (a class number in [1..N]) of the ith SVM for
document d, where (P ) equals one if predicate P is true and 0 otherwise, and where
Z(j) is a normalization factor Z = Pi=1:K (j 2 Si) . At the end uj (d; a) stands
for the percentage of classifiers, among those that exploit the jth feature, that predict
author a.</p>
      <p>
        We then use such profiles as an input to a second stage classification system. Those
profiles may be sorted (from the highest value to the lowest) so that the numbering of
the components are lost. Figure 3 shows such profiles for the 60 authors of the blog
dataset. As suggested by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] the rate the performance decreases may be relevant of a
particular author.
      </p>
      <p>At this point, each document in the train and the validation corpus correspond to
N profiles, one for every author. Based on this new dataset we learn a new classifier
to discriminate between positive examples (profiles that are built for the actual author)
and negative examples. The classifier is a prototype based method where one author (ine
class) is represented as the mean vector profile of this class computed on the training set.
At test time one computes a similarity between a test profile and the reference profile
for every class. We investigated euclidean distance and correlation similarity. Table</p>
      <p>When classifying a test document we first run the set of K SVM classifiers of the
first stage, tehn we build N p-dimensional profiles as above. We take the final decision
based on the second stage classifier that is run on these N second stage feature vectors
(looking for the highest similatity, lowest distance).</p>
      <p>Table 5 shows that this second approach also significantly outperforms the single
SVM approach and it allows reaching similar results as the standard bagging approach
while working on a very different representation of the documents. This let some hope
that the two methods could be advantageously combined which we did not investigate
by lack of time.
We presented an experimental investigation that show that one of the most competitive
method for author identification may suffer from undertraining. We built on this idea to
propose new approaches for author identification that rely on the idea of bagging
features. The first method is a rather traditional method for bagging features and achieved
interesting results on the PAN 2012 challenge, reaching the third place among eleven
participants on closed identification tasks. The second method extends the bagging
feature strategy and provides preliminary promising results.
8</p>
    </sec>
    <sec id="sec-7">
      <title>Aknowledgment</title>
      <p>Special thanks to Moshe Koppel of Bar-Ilan Univeristy in Israel for filling us with
corpus. This work has been done in the context of the SAIMSI project (reference
ANR09-CSOSG-SAIMSI) funded by the French Research Agency (ANR).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abbasi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , H.:
          <article-title>Applying authorship analysis to extremist-group web forum messages</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          <volume>20</volume>
          (
          <issue>5</issue>
          ),
          <fpage>67</fpage>
          -
          <lpage>75</lpage>
          (
          <year>Sep 2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Argamon-Engelson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Avneri</surname>
          </string-name>
          , G.:
          <article-title>Style-based text categorization: What newspaper am I reading?</article-title>
          <source>In: Proceedings of the AAAI Workshop on Text Categorization</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <issue>3</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>LIBSVM: A library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>2</volume>
          ,
          <issue>27</issue>
          :
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          :
          <fpage>27</fpage>
          (
          <year>2011</year>
          ), software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>FUCKS</surname>
          </string-name>
          , W.:
          <source>On mathematical analysis of style. Biometrika</source>
          <volume>39</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>122</fpage>
          -
          <lpage>129</lpage>
          (
          <year>1952</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hoover</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          :
          <article-title>Frequent Word Sequences</article-title>
          and
          <string-name>
            <given-names>Statistical</given-names>
            <surname>Stylistics</surname>
          </string-name>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>17</volume>
          ,
          <fpage>157</fpage>
          -
          <lpage>180</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weninger</surname>
            , T., Han,
            <given-names>J</given-names>
          </string-name>
          .,
          <string-name>
            <surname>Kim</surname>
          </string-name>
          , H.D.:
          <article-title>Authorship classification: a discriminative syntactic tree mining approach</article-title>
          . In: SIGIR (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>KJELL</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Authorship determination using letter pair frequency features with neural network classifiers</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>9</volume>
          (
          <issue>2</issue>
          ),
          <fpage>119</fpage>
          -
          <lpage>124</lpage>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Computational methods in authorship attribution</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>1</issue>
          ) (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Messeri</surname>
          </string-name>
          , E.:
          <article-title>Authorship attribution with thousands of candidate authors</article-title>
          .
          <source>In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <fpage>659</fpage>
          -
          <lpage>660</lpage>
          . SIGIR '06,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonchek-Dokow</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Measuring differentiability: unmasking pseudonymous authors</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Martindale</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McKenzie</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>On the utility of content analysis in author attribution: lt;igt;the federalistlt;/igt;</article-title>
          .
          <source>Computers and the Humanities</source>
          <volume>29</volume>
          ,
          <fpage>259</fpage>
          -
          <lpage>270</lpage>
          (
          <year>1995</year>
          ),
          <volume>10</volume>
          .1007/BF01830395
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>MEALAND</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          :
          <article-title>Correspondence analysis of luke</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>10</volume>
          (
          <issue>3</issue>
          ),
          <fpage>171</fpage>
          -
          <lpage>182</lpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>O</given-names>
            <surname>'Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Langford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Caruana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Blum</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Featureboost: A meta learning algorithm that improves model robustness</article-title>
          .
          <source>In: In Proceedings of the Seventeenth International Conference on Machine Learning</source>
          . pp.
          <fpage>703</fpage>
          -
          <lpage>710</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sindelar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mccallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Feature bagging: Preventing weight undertraining in structured discriminative learning</article-title>
          .
          <source>Tech. rep.</source>
          ,
          <source>CIIR</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Teytaud</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jalam</surname>
          </string-name>
          , R.:
          <article-title>Kernel-based text-categorization</article-title>
          .
          <source>In: In International Joint Conference on Neural Networks (IJCNNŠ2001</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>0</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Viola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Rapid object detection using a boosted cascade of simple features</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <year>2001</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>