<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sub-Profiling by Linguistic Dimensions to Solve the Authorship Attribution Task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Alabama at Birmingham Department of Computer and Information Sciences</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <abstract>
        <p>In this paper, we describe a modified version of the profile-based approach for the Authorship Attribution (AA) task of the PAN 2012 challenge. Our PAN system for AA utilizes the concept of linguistic modalities1 on profile-based (PB) approaches. We concatenate all the training documents from the same author and build author-specific sub-profiles, one per linguistic modality. Then instead of using all the different types of features to compute the similarity of a test document against an author's profile in a single step, we compute several similarity scores using one set of features (modality) at a time. Each modality will assign the test document to the author whose sub-profile has the highest cosine similarity in that modality. Final classification decisions are based on the combination of decisions from each modality using majority voting. We achieved competitive results on PAN 2012, with encouraging results on the closed-class authorship attribution.</p>
      </abstract>
      <kwd-group>
        <kwd>authorship attribution</kwd>
        <kwd>profile-based approach</kwd>
        <kwd>linguistic modality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Authorship attribution is the task of identifying the author of an unseen document. AA
has a long history with multiple application areas that include spam filtering [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], cyber
bullying, plagiarism detection [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], author recognition of a given program [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and web
information management. There are two flavors of AA tasks: closed-class and
openclass. Most of the research has focused on the closed-class authorship identification
task. In the closed class AA task, an anonymous test document should be assigned to
one of a known group of candidate authors. In contrast, in the open class problem, an
anonymous test document might belong to an author not included in the list of known
candidates. The closed-class problem is considered to be relatively easier than
openclass as the closed-class problem assumes that the unknown instance is drawn from the
classes present in the training set. We proposed a framework for both the closed-class
as well as the open class authorship attribution task of the PAN 2012 competition. The
PAN 2012 competition dataset considers different types of documents with different
lengths, including novel-length ones. The authorship task this year is more
challenging than previous years because each sub-task has a very limited number of training
1 Each linguistic modality refers to a type of feature.
instances per author, and the learning system should be able to effectively model each
author’s writing style.
      </p>
      <p>
        In a recent work, Solorio et al. (2011) showed that representing documents as a
set of separate linguistic modalities in a standard machine learning approach yields
good results in AA. On the other hand, profile-based (PB) approaches have consistently
shown promising results for the same task [
        <xref ref-type="bibr" rid="ref10 ref14 ref3 ref8">3,8,10,14</xref>
        ]. Our PAN system for AA
combines these two ideas. We use the notion of linguistic modalities to generate sub-profiles
of the authors, one per modality, and we use these sub-profiles to make predictions on
authorship based on similarity scores. This independent processing of modalities
follows the motivation in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], where they argue that the independent processing of
features by modalities allows to extract more meaningful similarity scores. However, in
that previous work each document is represented as a unique instance of the problem,
and the similarity scores are used as features to train a machine learning classifier. Here,
we concatenate all the documents from a single author to generate the subprofiles. Each
sub-profile then represents the author’s writing style across a specific dimension, and
each dimension will vote on a candidate author.
      </p>
      <p>We participated in both the closed-class and the open-class problem and submitted
one run per sub-task. The parameters of the final authorship attribution system, for
both the closed-class and the open-class were adjusted using only the training data. Our
system yielded very competitive results in the competition, reaching the best accuracies
for several tasks. In what follows, we will discuss in more detail the process we followed
during the parameter tuning of our system, as well as the evaluation results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Authorship attribution problems have been solved mainly using either machine learning
or profile-based approaches [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Machine learning approaches, where each document
in the training set is an instance of the problem, are based on a traditional text
classification framework. Here, a machine learning algorithm, such as Support Vector Machines,
will be trained using a set of feature vectors, each feature vector representing a
sample document [
        <xref ref-type="bibr" rid="ref2 ref6">6,2</xref>
        ]. Profile-based approaches, on the other hand, solve the authorship
attribution task by creating a profile of each author. For each author, all the training
instances are concatenated into a single file and appropriate features are extracted to
create author profiles. Our system borrows the idea of profiles because it has been
successfully used for the authorship attribution task [
        <xref ref-type="bibr" rid="ref11 ref7">7,11</xref>
        ] and because we consider that
having a very small training set, as is the case in this year’s competition, will represent
a challenge for traditional machine learning approaches. However, as mentioned earlier
we generate a set of sub-profiles per author, instead of a single one.
      </p>
      <p>
        Profile-based approaches have already been successfully used for attributing an
unknown text to its real author. Successful examples of this approach include: [
        <xref ref-type="bibr" rid="ref11 ref15 ref4 ref7 ref9">7,9,11,15,4</xref>
        ].
Frantzeskou et al. (2007) showed the use of a profile-based approach on different
datasets. They described how an effective and robust classifier can be built with the
utilization of a modified similarity measure. They proposed the Source Code Author
Profile (SCAP) method using byte-level n-grams to build the profiles, as presented
in Keselj et. al.’s (2003) method. The SCAP study carries broad implications for the
researchers in authorship attribution as it highlights the use of language-independent
features and a simple similarity measure.
      </p>
      <p>
        A lot of emphasis has also been placed on the use of machine learning for AA.
Successful examples include a wide variety of learning algorithms, such as Support
Vector Machines [
        <xref ref-type="bibr" rid="ref17 ref2">2,17</xref>
        ], decision trees [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and memory based learners [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The work
more closely related to our system is that of Solorio et al. (2011). They proposed a
system that uses features from different linguistic dimensions (syntactic, lexical,
stylistic, perplexity values) where each dimension is treated independently from each other.
These dimensions are used to generate informative meta features, where they assume
that meaningful similarity patterns among authors will emerge more clearly if the
similarities are computed for each modality. After the similarities for all modalities are
extracted, they are merged and used as features in a standard machine learning setting.
As mentioned earlier we borrow the idea of modalities in our PAN system. The details
are described in the following section.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Predicting Authorship Using Sub-profiles</title>
      <p>
        The main difference between our approach and the standard PB approaches is that we
use a set of sub-profiles to make the predictions, instead of computing a single one. The
different sub-profiles correspond to the different linguistic dimensions in the writeprint
of authors. We followed the notion of modalities proposed in previous work [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
although the exact features in each modality is not exactly the same as in previous work.
Here we used stylistic, syntactic, semantic, character n-grams and word n-grams,
resulting in five different modalities. The stylistic features include number of
punctuations, number of sentences, number of tokens, and number of contractions, among other
features. The syntactic features are the combination of all the part-of-speech (POS) tag
unigrams, and the top n bigrams, trigrams, and grammatical relations from dependency
parses. In the semantic modality we use the top n words after removing stop words.
With the increased popularity and effectiveness of n-grams for authorship attribution
tasks [
        <xref ref-type="bibr" rid="ref3 ref8">8,3</xref>
        ], we consider character level, as well as word level, n-grams as two
different modalities. Character n-grams contain the top n trigrams. The last modality, word
n-grams, contains the top 1500 occurrences of each n-gram, where n = 1, 2, 3. Table
1 provides all the features used in the different modalities. After the sub-profiles have
been extracted we then compute the similarity scores between the test and candidate
authors sub-profiles. Our approach uses the cosine distance [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] as our similarity metric
since this has been successfully used in previous work. Note that this decision imposes
a restriction on the set of features used in each profile. Typical PB approaches select the
set of features in the profile on a per author basis. Our system uses a fixed set of
features for all authors, and this set is determined by the training set. For modality m, the
cosine similarity between two sub-profiles (the one from author a, and the one for test
document t) represented by Pa and Pt, each having lm elements, is computed using
Equation 1.
      </p>
      <p>simm(Pa; Pt) =</p>
      <p>Pa:Pt
jPajjPtj
∑lim=1 Pai</p>
      <p>Pti
= √∑lm
i=1 Pai2
√∑lm
i=1 Pti2
(1)</p>
      <p>
        We allow each modality to make authorship predictions based on the cosine
similarity of the authors sub-profiles in that modality. Each modality will assign the test
document to the author whose sub-profile has the highest cosine similarity in that
modality. Final classification decisions are based on the combination of decisions from each
modality using an appropriate voting mechanism [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Depending upon the type of AA
problem, the weight of each modality in the final voting can be adjusted to improve the
performance of the AA system.
      </p>
      <p>Due to the limited number of training instances per author, here we use the
majority voting to combine the decisions of each modality in the ensemble. Let x be a test
instance. The final prediction pt for test instance x is determined as:</p>
      <p>
        k
pt = argmax ∑
y
m=1
(hm(x); y);
y 2 Y;
(2)
Where hm(x) is the prediction of modality m, k is the number of modalities, (i; j) is
an indicator function whose value is 1 iff i == j and 0 otherwise, and Y = y1; :::; yl
is the set of possible classes (authors). To increase the prediction performance as well
as decrease the computation overhead, we decided to perform feature selection based
on information gain. To determine the percentage of features to be selected from each
modality, we used 50% of the training data as validation set and remaining 50% as the
training. We experimented using different percentage of the features for each modality:
20%, 40%, 60%, and 80%. Looking at the performance of the system on the
validation set, we determined the number of features to be used for the attribution of test
instances. After analyzing the performance on validation set, we decided to use 80%
of the stylistic features, 60% of the syntactic features, 20% of the semantic features,
20% of character n-grams, and 40% of word n-grams. To perform the feature selection
based on information gain, we converted the feature vector file to arff format and used
WEKA [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. These parameters are fixed for all the datasets so that we can actually
analyze how robust and consistent is our AA system on different datasets. This framework
for the close-class AA system is evaluated in three datasets released by PAN 2012.
      </p>
      <p>We also participated in the open class authorship attribution task of PAN 2012. The
open class task is much harder than the closed class as test documents may belong to
an unknown class. We did not do feature selection for this task. However, for the open
class problem, we have an extra modality containing perplexity values from 4-gram
language models at the character level. The extra modality was added after evaluating
the performance of the open-class AA system on the validation set. When the
improvement in the performance was observed with the addition of extra modality, we decided
to include it for the open-class problem.</p>
      <p>To deal with the “out of training set" cases, we decided to use a threshold on the
difference between the cosine distance of the 1st and 2nd prediction of each modality.
More specifically, to determine if a test document belongs to an unknown author, we
compare the highest and the second highest cosine similarity score of the author
subprofiles for each modality. If the difference between them is smaller than the threshold
value , we decide that the test instance belongs to none of the authors in that modality.
We consider the threshold a sort of filter for confusing cases, and we assign those
confusing cases to the “unknown" category. The idea behind this approach is that if there
is indecision, as indicated by a close similarity score between different authors, this
indecision is coming from trying to force an assignment of authorship to a document
that has no clear match inside the training set, then we should not assign that author to
any of our candidate authors, and the system should predict “unknown" in that case.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results and Evaluation</title>
      <p>
        All the parameters in our system were defined based on different experiments using
only the training data. We used 50% of the training instances as validation data and the
remaining 50% as the training data to set the parameters. In the cases of n-grams, we
selected the top 3000 features, except for word n-grams because the performance on
the validation set was better when the top 1500 of each word n-gram were used, where
n = 1, 2, 3. We used the Stanford parser [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to generate the parse trees. For both the
closed-class and the open-class problem, we performed experiments separately for each
modality on the provided datasets.
      </p>
      <p>There are a total of six datasets, three from the closed-class problem and three from
the open-class problem. For each dataset in the closed-class problem, there is a
corresponding dataset in the open-class problem. Therefore, the training data is the same
for both the closed-class and open-class problems while test data is different. There
are three kinds of datasets for each problem: one with three authors, another with eight
authors, and the last with 14 authors. As described in Section 3, we first create the
author-specific sub-profiles, one per linguistic modality. Then the prediction of the test
instances is carried out separately on a modality basis. This means each modality will
have its own prediction on the same test instance. The predictions are combined using
a majority vote.</p>
      <p>Modality Type</p>
      <p>Stylistic Syntactic Semantic Character n-gram Word n-gram Majority Voting</p>
      <p>Table 2 presents the performance of the individual modalities as well as the
combination of the predictions from each modality on three different datasets provided in
PAN 2012 authorship attribution competition. Our system achieves an overall accuracy
of 100% on two out of the three datasets. The accuracy on the remaining dataset (14
authors) is also very high. As it can be seen, different modalities have different prediction
accuracies on different datasets. For 3 Authors, syntactic as well as character n-grams
obtain 100% accuracy, while all other modalities obtain 83.33%. Similarly, for 8
Authors, each modality performs differently. As expected, when we combine the individual
predictions of the sub-profiles in an ensemble, the accuracy of the final combination is
either higher or equal to that of the best individual classifier. The final accuracy for 8
Authors is 100% , which is more than the accuracy of each individual classifier. After
looking at the accuracies of other PAN participants on each dataset, we observed that no
participants obtained accuracy higher than ours. This proves that our AA system for the
closed-class problem worked consistently well across different datasets and was among
one of the the top AA closed-class systems in the competition.</p>
      <p>In the open-class problem, the challenging part is to determine the threshold value.
We performed experiments against different datasets using 0.02, 0.04, 0.06, and 0.08
as the threshold value ( ) and measuring accuracy on the validation set. Based on the
performance of the AA system on validation set against the different values of we
decided to set to 0.08. We even tried by increasing the value of beyond 0.08, but
very poor performance on the validation set was obtained. As a reminder, we compare
the highest and the second highest cosine similarity score of the author sub-profiles
for each modality. The difference between them is compared against the threshold to
determine if an instance belongs to an “unknown" category.</p>
      <p>Table 3 presents the accuracy from using individual linguistic modalities, as well
as the combination of the predictions from each modality for the open class problem
of the PAN 2012 competition. Even though our results for the third dataset with 14
authors was not good, for the other two data sets our system yielded promising results.
After combining the predictions from individual modalities, an accuracy of 60% was
observed for the 3 Authors data set. This accuracy can be considered competitive when
compared with that of other competitors on the same dataset. Our open-class AA system
performed very well on the 8 Authors data set, reaching an accuracy of 76.47%, which
is the highest accuracy obtained on that dataset. After analyzing the performance on
the open-class datasets, we see that in only one case (row 2 of Table 3), the combined
accuracy is higher than that of the individual classifiers. While for other two datasets,
the combination of predictions from the sub-profiles is not the one with the best results.
This could be due to having some very weak predictors as part of the ensemble.</p>
      <p>For both closed-class and open-class problem, sub-profiling by linguistic modality
achieved a very good performance overall in the PAN 2012 authorship attribution
challenge. However, further experiments with additional datasets is required to evaluate the
performance of the algorithm that we introduced in this paper. For the open-class
problem, we proposed a new way of assigning a test instance to an “unknown" category.
This new algorithm successfully worked for all the three datasets. However, additional
empirical analyses are needed in order to explore different parameter values for this
setting.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we described an approach to the authorship attribution task adopted for
the PAN 2012 competition. We introduced the notion of generating sub-profiles that
represent the author writeprint along specific linguistic dimensions. Although the idea
of linguistic modalities has been explored by previous work, it was used in a machine
learning setting, while here we adopt that framework and combine it with a
profilebased approach. The evaluation results on different datasets show this is a competitive
AA framework for closed-class as well as open-class AA problems. For each dataset
in the closed-class AA problem, our system matched the highest accuracy reported in
the competition. This clearly illustrates that the proposed algorithm for closed-class
problems is very competitive and consistent.</p>
      <p>In the open-class AA framework our system did not reach the same high
performance, but was also very competitive in the challenge. For one out of the three
openclass datasets, we were able to get the best accuracy of the PAN 2012 challenge. The
performance of this system for the dataset with 14 authors was not impressive. This
leaves us with an impression that we need to keep investigating for a better way to
handle open class settings.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work was partially supported by ONR grant N00014-12-1-0217. The authors
would like to thank the organizers of the PAN 2012 competition for putting together
this event.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Dietterich</surname>
          </string-name>
          , T.G.:
          <article-title>Ensemble methods in machine learning</article-title>
          .
          <source>In: International Workshop on Multiple Classifier Systems</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . Springer-Verlag (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Local histograms of character n-grams for authorship attribution</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <fpage>288</fpage>
          -
          <lpage>298</lpage>
          .
          <article-title>Association for Computational Linguistics (ACL) (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Frantzeskou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gritzalis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaski</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          :
          <article-title>Identifying authorship by byte-level n-grams: The source code author profile (SCAP) method</article-title>
          .
          <source>Journal of Digital Evidence</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ) (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Frantzeskou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gritzalis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaski</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Howald</surname>
            ,
            <given-names>B.S.:</given-names>
          </string-name>
          <article-title>Identifying authorship by byte-level n-grams: The source code author profile (scap) method</article-title>
          .
          <source>IJDE</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution: A principal component and linear discriminant analysis of the consistent programmer hypothesis</article-title>
          .
          <source>I. J. Comput. Appl</source>
          . pp.
          <fpage>79</fpage>
          -
          <lpage>99</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Houvardas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>N-gram feature selection for author identification</article-title>
          .
          <source>In: Proceedings of the 12th International Conference on Artificial Intelligence: Methodology, Systems, and Applications. LNCS</source>
          , vol.
          <volume>4183</volume>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>86</lpage>
          . Springer, Varna, Bulgaria (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Joula</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <fpage>233</fpage>
          -
          <lpage>334</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Keselj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cercone</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>N-gram based author profiles for authorship attribution</article-title>
          .
          <source>In: Proceedings of the Pacific Association for Computational Linguistics</source>
          . pp.
          <fpage>255</fpage>
          -
          <lpage>264</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Keselj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cercone</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>N-gram based author profiles for authorship attribution</article-title>
          .
          <source>In: Proceedings of the Pacific Association for Computational Linguistics</source>
          . pp.
          <fpage>255</fpage>
          -
          <lpage>264</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution in the wild</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>45</volume>
          ,
          <fpage>83</fpage>
          -
          <lpage>94</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lambers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veenman</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>Forensic authorship attribution using compression distances to prototypes</article-title>
          . In: Geradts,
          <string-name>
            <given-names>Z.J.M.H.</given-names>
            ,
            <surname>Franke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.Y.</given-names>
            ,
            <surname>Veenman</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.J. (eds.) IWCF</surname>
          </string-name>
          <year>2009</year>
          . vol.
          <source>LNCS 5718</source>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>24</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Luyckx</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Shallow text analysis and machine learning for authorship attribution</article-title>
          .
          <source>Proceedings of the Fifteenth Meeting of Computational Linguistics in the Netherlands (CLIN)</source>
          pp.
          <fpage>149</fpage>
          -
          <lpage>160</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Marneffe</surname>
          </string-name>
          , M.D.,
          <string-name>
            <surname>MacCartney</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Generating typed dependency parses from phrase structure parses</article-title>
          .
          <source>In: LREC</source>
          <year>2006</year>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mosteller</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wallace</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          :
          <article-title>Inference and Disputed Authorship: The Federalist</article-title>
          .
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          (
          <year>1964</year>
          ), http://hdl.handle.net/2027/mdp.39015004063254
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shuurmans</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keselj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Language independent authorship attribution using character level language models</article-title>
          .
          <source>In: Proceedings of the 10th conference of the European chapter of the Association for Computational Linguistics</source>
          . vol.
          <volume>1</volume>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>274</lpage>
          . Budapest, Hungary (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Term-weighting approaches in automatic text retrieval</article-title>
          .
          <source>Information Processing Management</source>
          <volume>24</volume>
          (
          <issue>5</issue>
          ),
          <fpage>513</fpage>
          -
          <lpage>523</lpage>
          (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pillay</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            , S., y Gómez,
            <given-names>M.M.:</given-names>
          </string-name>
          <article-title>Generating metafeatures for authorship attribution on web forum posts</article-title>
          .
          <source>In: 5th International Joint Conference on Natural Language Processing</source>
          ,
          <string-name>
            <surname>IJCNLP</surname>
          </string-name>
          <year>2011</year>
          . pp.
          <fpage>156</fpage>
          -
          <lpage>164</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Plagiarism detection using stopword n-grams</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          (
          <year>2011</year>
          ), http://dx.doi.org/10.1002/asi.21630
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey on modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>de Vel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corney</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohay</surname>
          </string-name>
          , G.:
          <article-title>Multi-topic e-mail authorship attribution forensics</article-title>
          .
          <source>In: Proceedings of the Workshop on Data Mining for Security Applications, 8th ACM Conference on Computer Security</source>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Data Mining: Practical Machine Learning Tools and Techniques</article-title>
          . Morgan Kauffmann, 2nd edn. (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zobel</surname>
          </string-name>
          , J.:
          <article-title>Effective and scalable authorship attribution using function words</article-title>
          .
          <source>In: Proceedings of 2nd Asian Information Retrieval Symposium. LNCS</source>
          , vol.
          <volume>3689</volume>
          , pp.
          <fpage>174</fpage>
          -
          <lpage>189</lpage>
          . Jeju Island,
          <string-name>
            <surname>Korea</surname>
          </string-name>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>