<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A simple Local n-gram Ensemble for Authorship Verification</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Internet Commerce Security Laboratory Federation University Australia</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>1073</fpage>
      <lpage>1078</lpage>
      <abstract>
        <p>The authorship verification task requires deciding whether a given test document was written by the same author as a training set. For my attempt I tested a simple voting ensemble of local (character) n-gram methods, using a grid search to choose parameters. This results in a method that requires little preconfiguration and can be applied to any language with a concept of characters. The method itself is quite fast, however training is slow with the large number of attempted parameter combinations. The approach results in accuracies of around 60% depending on the corpus and application.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Local n-gram (LNG) methods employ a profile based character n-gram approach to
authorship analysis. An author’s profile consists of the top L n-grams for a given author,
different to many feature selection methods which usually use a globally relevant set
of features. The author profile is then compared to a similarly created document
profile using a given distance metric. For authorship attribution, a classification task, each
candidate author is first profiled. A test document of unknown authorship is then
profiled and the distance to each candidate author profile is computed. The author with the
lowest distance is predicted as the author.</p>
      <p>Applying LNG to authorship verification makes use of the concept of distance, but
applies in a different way. First we formally define the problem.</p>
      <p>We are given a training set of documents D (usually such that 1 &lt;= jDj &lt;= 5) all
authored by the same person A. Next, we are given a test document dt. The task is to
determine whether the author of dt is A, i.e. the author of the documents in D. We refer
to the above task as a single trial, with the authorship verification task composing of a
large number of trials in different languages and different contexts.</p>
      <p>We employed a straight-forward translation of the use of LNG for classification
purposes to authorship verification purposes. We calculated the distance between the test
document dt and each of the documents in the training set D, called the inter-distance,
and compared that to the internal distance between documents in D to themselves, the
intra-distance. The assumption was that if the inter-distance was approximately equal
to the intra-distance, then the document was likely to be from the same author. If they
were not approximately equal, then it is more likely a different author wrote dt.
1.1
Our results were quite poor for this year’s competition. The focus on automation of
the algorithm may have hurt performance, due to a lack of tuning. It would be
recommended to use a more fine-tuned approach, rather than a generic ‘catch-all’ approach.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Datasets</title>
      <p>
        There were six datasets in the training corpus, and a total of 596 individual trials. There
were four languages represented in the released datasets; Dutch, English, Greek and
Spanish. Compositions of the datasets are provided in table 1, and all datasets had
approximately equal number of positive and negative trials.
In recent years, the authorship analysis field has used machine learning techniques for
a majority of research [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In this paper, we too focus on such techniques. Algorithms
in authorship analysis can be placed into two categories; global and local methods.
Global methods fit a more standard feature based machine learning methodology. In
this methodology, a set of features is used to take measurements of each of a set of
documents. This gives us a matrix X such that Xi;j is the value of feature j of document
i. This model can be used as input into a large number of classification or clustering
algorithms, such as Support Vector Machines or the k-means algorithm.
      </p>
      <p>
        Advances in local algorithms have shown great success in this alternate form of
model. A document, or set of documents, is represented as a profile P such that P (x)
is the value of feature x for the document, or set of documents. The features chosen
are usually character n-grams, subsequences of continuous characters or length n. The
value of P (x) is then given as the frequency of n-gram x in the document, or set of
documents, being profiled. When local models are used with character n-grams, the
approach is called Local n-grams (LNG). What makes a local model particularly different
from a global model is that there is no global set of features. By profiling the set of all
documents known to be from one author, we can profile that author’s writings. While
most applications of LNG have been supervised, LNG methods have also been used for
unsupervised methodologies, outperforming feature based models [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>Instead, each document (or set) is profiled using the set of features most distinctive
to that document (or set). This means each profile has a separate set of features
associated with it. The phrase most distinctive usually means ‘most frequent’, however the
RLP algorithm introduced later has a different definition.</p>
      <p>As notation for the following, we state that a feature x is ‘in’ a profile P if P (x) 6= 0.
The intersection of two profiles is the set of features that are in both, and the union is
the set of features in either (ignoring values).</p>
      <p>
        The first model of this type was the Common n-grams (CNG) method [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A profile
is given as the set of the L most frequent n-grams, for some value of L. Profiles are
then compared using equation 1. A document of unknown authorship is attributed to
the author with the most similar profile.
(1)
(2)
X
2 (P1(x)
      </p>
      <p>P2(x)) !2</p>
      <p>P1(x) + P2(x)
K(P1; P2) =</p>
      <p>x2XP1 [XP2</p>
      <p>
        A variant of this form, the Source Code Author Profile (SCAP) algorithm, was
introduced by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This algorithm is a variant of CNG, with only one change. Rather
than using equation 1 to compare profiles, the similarity of two profiles is given as the
size of the set intersection of them. The higher the number of features in the intersection,
the more similar they are. This is bounded by the choice of L as an input, and therefore
can be normalised by dividing by L. The major finding by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] was that this approach
approximated the results of [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] using a much simpler algorithm. SCAP can be very fast
to run on modern systems and provides a good approximation to CNG, allowing it’s
use in prototyping [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. CNG-WPI (Weighted Profile Intersection) is an improvement
to SCAP which weights the n-grams based on the number of documents they appear in
(inferring the likelihood of the n-gram to appear in both profiles) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Stamatatos’ d1 and d2 measures are improvements designed to work with
unbalanced datasets [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Their approaches weighted the profile similarity comparison using
a profile of language default values. This approach was found to be more effective for
imbalanced dataset than CNG, while less effective for balanced dataset.
      </p>
      <p>
        The Recentred Local Profiles (RLP) algorithm was developed by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], again
using this concept of a language default profile. The derivation of the CNG methodology
used work by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which originally included a concept of a language default value; the
expected frequency of a particular n-gram in normal use of the language. This
language default was removed in a simplification of the algorithm which lead to the CNG
methodology. RLP reinstated this component, which adjusted profile weights based on
this language default value, such that PD(x) = PDC (x) PL(x), where PDC is the LNG
profile of the doucment, PL is the CNG profile of all documents in that language. A
profile then consisted of the L most distinctive n-grams, i.e. those with the highest
absolute weights. Documents are then compared using a variant of the cosine distance
metric, given in equation 2.
      </p>
      <p>R(P1; P2) = 1</p>
      <p>P1 P2
jjP1jj2jjP2jj2</p>
      <p>
        LNG methods have shown a high accuracy in difficult domains [
        <xref ref-type="bibr" rid="ref14 ref4 ref8 ref9">9,8,14,4</xref>
        ]. In
addition, little recoding is needed to apply them in multiple languages. Almost every
language has a definition of ‘character’, and for those that do not, LNG methods can be
applied to byte level n-grams, rather than characters [
        <xref ref-type="bibr" rid="ref3 ref6">6,3</xref>
        ]. These methods are
remarkably robust to different languages, and can be applied to source code languages, as well
as natural languages [
        <xref ref-type="bibr" rid="ref12 ref2">2,12</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>LNG Methods Employed</title>
      <p>For this application, we used the CNG, SCAP and RLP methods as base level
classifications. These methods were chosen as a representative sample of LNG methods, not
specifically due to the relative efficacy, as other LNG methods achieve similar
accuracies. The n values chosen for the task were 3 to 5 inclusive, while L values of 1000,
2000, 5000 and 10000. This set of parameters was chosen to be small enough to
compute in a reasonable time, while still being relatively representative of feature values
proven effective in other studies.</p>
      <p>For each of the based methods (CNG, SCAP and RLP), a grid search of parameters
was conducted to find the most accurate. The grid search compared all combinations of
n and L values
5</p>
    </sec>
    <sec id="sec-4">
      <title>Applying Thresholds for Verification</title>
      <p>We used a distance based threshold for determining whether the given document
belonged to a specific author. This threshold was relative to the documents themselves,
not a global value.</p>
      <p>We calculated the distance between the test document dt and each of the documents
in the training set D, called the inter-distance, and compared that to the internal distance
between documents in D to themselves, the intra-distance. The assumption was that if
the inter-distance was approximately equal to the intra-distance, then the document was
likely to be from the same author. If they were not approximately equal, then it is more
likely a different author wrote dt.</p>
      <p>Because there was only one test document, we could not apply standard statistical
distribution comparisons, and instead opted for simpler approach. The inter-distance
was considered to be approximately equal to the intra distance if it was less than the
average intra-distance, plus two standard deviations (of the intra-distance for a given
dataset D).</p>
      <p>The obtained results were less than expected, and less than expected for the
individual based models. This suggests that this threshold method may require extensive work,
and perhaps an alternate strategy.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Summary of Results</title>
      <p>While there was some variability to the results and rank (with a top rank of 4th on one
corpus), the results were typically quite poor. The baseline for the results is
approximately 0.5, which was beaten in all datasets, but only barely in most. The performance
was also less than expected, based on cross-validation results within the dataset. The
reason for this is likely the small amount of data was not properly accounted for in the
cross-validation model, meaning that the final model overfit the data. The early
investigations into these errors suggests that this is caused by the threshold based method,
and not the baseline LNG methods. There were some errors that can be traced to the
baseline LNG methods though, suggesting that further testing is necessary for this form
of application.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bennett</surname>
            ,
            <given-names>W.R.</given-names>
          </string-name>
          :
          <article-title>Scientific and engineering problem-solving with the computer</article-title>
          . Prentice Hall PTR Upper Saddle River, NJ, USA (
          <year>1976</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tahaghoghi</surname>
            ,
            <given-names>S.M.:</given-names>
          </string-name>
          <article-title>Source code authorship attribution using n-grams</article-title>
          .
          <source>In: Proceedings of the Twelth Australasian Document Computing Symposium</source>
          , Melbourne, Australia, RMIT University. pp.
          <fpage>32</fpage>
          -
          <lpage>39</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uitdenbogerd</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turpin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Application of information retrieval techniques for source code authorship attribution</article-title>
          .
          <source>In: Database Systems for Advanced Applications</source>
          . p.
          <source>699âA˘ S¸ 713</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chatzicharalampous</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frantzeskou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Author identification in imbalanced sets of source code samples</article-title>
          .
          <source>In: Tools with Artificial Intelligence (ICTAI)</source>
          ,
          <source>2012 IEEE 24th International Conference on. vol. 1</source>
          , p.
          <source>790âA˘ S¸ 797</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          , Montes-y
          <string-name>
            <surname>Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A weighted profile intersection measure for profile-based authorship attribution</article-title>
          .
          <source>In: Advances in Artificial Intelligence</source>
          , pp.
          <fpage>232</fpage>
          -
          <lpage>243</lpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Frantzeskou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gritzalis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaski</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          :
          <article-title>Identifying authorship by byte-level n-grams: The source code author profile (SCAP) method</article-title>
          .
          <source>Int. Journal of Digital Evidence</source>
          <volume>6</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kešelj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cercone</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>N-gram-based author profiles for authorship attribution</article-title>
          .
          <source>In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING</source>
          . vol.
          <volume>3</volume>
          , p.
          <source>255âA˘ S¸ 264</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Layton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCombie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watters</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution of irc messages using inverse author frequency</article-title>
          .
          <source>In: Cybercrime and Trustworthy Computing Workshop (CTC)</source>
          ,
          <year>2012</year>
          Third. pp.
          <fpage>7</fpage>
          -
          <lpage>13</lpage>
          . IEEE (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Layton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watters</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dazeley</surname>
          </string-name>
          , R.:
          <article-title>Authorship attribution for twitter in 140 characters or less</article-title>
          .
          <source>In: Cybercrime and Trustworthy Computing Workshop (CTC)</source>
          ,
          <year>2010</year>
          Second. pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . IEEE (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Layton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watters</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dazeley</surname>
          </string-name>
          , R.:
          <article-title>Automatically determining phishing campaigns using the uscap methodology</article-title>
          .
          <source>In: eCrime Researchers Summit (eCrime)</source>
          ,
          <year>2010</year>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . IEEE (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Layton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watters</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dazeley</surname>
          </string-name>
          , R.:
          <article-title>Recentred local profiles for authorship attribution</article-title>
          .
          <source>Natural Language Engineering</source>
          <volume>18</volume>
          (
          <issue>03</issue>
          ),
          <fpage>293</fpage>
          -
          <lpage>312</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Layton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watters</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dazeley</surname>
          </string-name>
          , R.:
          <article-title>Unsupervised authorship analysis of phishing webpages</article-title>
          .
          <source>In: Communications and Information Technologies (ISCIT)</source>
          ,
          <source>2012 International Symposium on</source>
          . pp.
          <fpage>1104</fpage>
          -
          <lpage>1109</lpage>
          . IEEE (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Layton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watters</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dazeley</surname>
          </string-name>
          , R.:
          <article-title>Automated unsupervised authorship analysis using evidence accumulation clustering</article-title>
          .
          <source>Natural Language Engineering</source>
          <volume>19</volume>
          (
          <issue>01</issue>
          ),
          <fpage>95</fpage>
          -
          <lpage>120</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Author identification using imbalanced and limited training texts</article-title>
          .
          <source>In: Database and Expert Systems Applications</source>
          ,
          <year>2007</year>
          . DEXA'
          <volume>07</volume>
          . 18th International Workshop on. p.
          <source>237âA˘ S¸ 241</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>