Local n-grams for Author Identification
                        Notebook for PAN at CLEF 2013

                   Robert Layton, Paul Watters, and Richard Dazeley

                          Internet Commerce Security Laboratory
                                   University of Ballarat
         r.layton@icsl.com.au, p.watters@ballarat.edu.au, r.dazeley@ballarat.edu.au


       Abstract Our approach to the author identification task uses existing author-
       ship attribution methods using local n-grams (LNG) and performs a weighted
       ensemble. This approach came in third for this year’s competition, using a rel-
       atively simple scheme of weights by training set accuracy. LNG models create
       profiles, consisting of a list of character n-grams that best represent a particular
       author’s writing. The use of a weighted ensemble improved upon the accuracy of
       the method without reducing the speed of the algorithm; the submitted solution
       was not only near the top of the leaderboard in terms of accuracy, but it was also
       one of the faster algorithms submitted.


    The authorship identification task at PAN 2013 was a variation on a standard au-
thorship analysis task of authorship attribution. In authorship identification, we have a
training set of documents from the same author and a test document of unknown author-
ship. The task is to determine whether the author of the training documents was the one
that wrote the test document. This task is different from authorship attribution in a few
ways. First, we cannot simply take a ‘best guess’ whereby we find the best matching
author. A decision on match or no match must be made, similar to the open set problem
of authorship attribution, whereby the actual author may not be in the candidate set.
Second, we have no point of reference to compare the similarity of author to document.
In other words, we cannot know relatively if two profiles are similar and must therefore
find algorithms that are able to know absolutely if two profiles match. Third, specifi-
cally for this task, the number of documents was small. Most problems in this task had
just three documents from the same author, reducing the ability to determine variance.


1   Local n-grams

Local n-gram models were first proposed by [9] in 2003, who introduced the Common
n-grams algorithm (CNG). CNG was developed for authorship attribution, working
by profiling authors and documents based on the list of the top L character n-grams.
Character n-grams had been used in authorship studies prior, but previous methods used
a ‘global’ set of n-grams. CNG instead used ‘local’ profiles, whereby an author’s (or
document’s) profile consists only of the most frequent character n-grams for that author
(or document).
    Authorship attribution for a test document is performed by comparing the profile
for the document with the profiles of each of the candidate authors using equation 1 (
where Pi (x) is the frequency of n-gram x for a given profile). The most similar profile
is considered the author of the test document.
                                                                         !2
                                       X         2 · (P1 (x) − P2 (x))
                  dK (P1 , P2 ) =                                                       (1)
                                                    P1 (x) + P2 (x)
                                    x∈XP1 ∪XP2

     [5] identified that a much simpler algorithm was capable of comparable results,
called Source Code Author Profiling (SCAP). Rather than use the frequency of the
most frequent n-grams, two profiles are compared simply by the size of the intersection
of the n-grams in the profiles. While this does reduce the accuracy, the net result is often
quite small and the algorithm is significantly faster. As an empirical result, our code for
CNG performs a profile comparison in about three milliseconds, while for SCAP it is
less than one millisecond (our testing framework does not give better precision).
     Further variations were developed by [14], who focused on variations of CNG for
unbalanced datasets. The first variation, d1 has a single subtle change, using only the
n-grams in the test document for comparison. The second variation, d2 proposed in [14]
is to use a training corpus profile, called PC , which is composed of all the documents
in the training set. The distance is given as the dot product of the d1 distance from the
document to author and the d1 distance of the document to the language profile. These
alterations to CNG proved to be more accurate than ‘standard’ CNG when there was
a limited number of documents for an author. However when an adequate number of
documents per author is reached, the results reverse and standard CNG, using equation
1, proves to be more accurate.
     Another variation was developed by [13] named Recentred Local Profiles (RLP).
While the distance metric was different from CNG and SCAP, using a variation of the
cosine distance metric, the main difference was the use of a language default value,
similar to [14]. Rather than profile using the most frequent n-grams for an author or
document, the most distinctive n-grams are chosen. An n-gram is more distinctive than
another if the absolute distance to the expected value for the language is higher. The
expected value for an n-gram can be calculated from a corpus of text in the language,
and has the benefit of being able to use unlabelled data. RLP was shown to have a high
quality on many datasets, including the AAAC corpus [7].
     LNG methods have been used for a significant number of tasks. Some successful ap-
plications include authorship of twitter posts [11], student coding essays [2,3], software
forensics [6], internet relay chat logs [10], malware profiling [1] and has been extended
into automated unsupervised applications [12]. LNG based approaches are highly ap-
plicable to any language, as the notion of a character is quite prolific. In cases where a
character can be hard to define, often examining the bytes of a representation (such as
unicode) can provide an adequate substitution [4].

2   Methodology
The LNG methods described in the literature are mostly developed for authorship attri-
bution and not authorship identification. Intuitively, we created an algorithm that iden-
tifies if the test document compares to the training set in the same way as documents in
the training set compare to each other - i.e. if it looks like a training document, it was
probably written by the same person. To convert the methods to an authorship identifi-
cation task, we applied the following procedure.
 1. Calculate the pairwise distance between all documents in the training set and call
    this distribution du.
 2. Calculate the distance between the test document and each of the training docu-
    ments, and call this distribution dt.
 3. Compute the number of elements in dt that are less than the mean of du and call
    this p.
 4. If p < t for some t (set to 0.5 by default), then the test document is considered a
    match. Otherwise, it is not a match.
    Formally, a ‘problem’ is defined as a set of documents from the same author (the
training set) and a document of unknown authorship (the test document). The aim is to
determine if the author of the training set wrote the test document.
    The ensemble method was inspired by [8], which used a weighted voting model
with local n-grams, including a ‘reject’ option, indicating if no match was found. In
that work, a large number of CNG models were used with different parameters, and the
weights were assigned based on the ratio r = 1− ab , where a is the distance between the
test profile and the best matching author profile and b is the distance to the second best.
Note that this application is valid for authorship attribution tasks, and not for authorship
identification, as we only have one candidate author.
    Instead, the method we used was to assign each model a weight according to its
accuracy in the training set. Any model with a weight of less than 0.4 is reduced to 0.
The reason for this is that in the author identification task, there are only two possi-
ble answers - match or no match. This gives a baseline accuracy of 0.5, assuming the
test dataset contains equal numbers of problems of each type. We decided that models
scoring under 0.4 were a hindrance rather than a help.
    To summarise, out approach was to use a set of LNG models and ask each whether
they thought the document was from the same author or not. This used the approach
given in section 2. Each model’s prediction was weighted according to their accuracy in
the training set, with models scoring less than 0.4 given a weight of 0. The weights for
both options (match or no match) were summed and the option with the highest weight
was chosen as the final answer. The models included in the ensemble were the CNG,
SCAP, RLP, d1 and d2 models. The parameters were n ∈ {3, 4, 5} and L = 1000 for
each of the five models. We hypothesise that a larger number of models would have
improved accuracy, but was not able to test this hypothesis for this competition.

3   Results
The competition organisers released a set of problems for training, which were used in
a cross-fold methodology to evaluate the algorithms. The results for each fold are given
in table 3 and the overall accuracy was 71.25%. For the competition, the algorithm
achieved and f1 -score of 0.671, which placed it in third. In addition, the algorithm was
the forth fastest in the competition.
      Fold         Mean      1       2    3      4     5     6     7    8    9    10
      Accuracies 0.71250 1.000 1.000 0.333 1.000 0.500 0.333 1.000 0.333 1.000 0.625
                  Table 1. Results from each fold using weighted ensemble.


References
 1. Alazab, M., Layton, R., Venkataraman, S., Watters, P.: Malware detection based on structural
    and behavioural features of api calls (2010)
 2. Burrows, S.: Source Code Authorship Attribution. Ph.D. thesis, RMIT University (2010)
 3. Burrows, S., Uitdenbogerd, A.L., Turpin, A.: Application of information retrieval techniques
    for source code authorship attribution. In: Database Systems for Advanced Applications. pp.
    699–713. Springer (2009)
 4. Frantzeskou, G., Gritzalis, S., Macdonell, S.G.: Source code authorship analysis for support-
    ing the cybercrime investigation process. In: Proceedings of the first International Conference
    on e-business and Telecommunications Networks (ICETE04), Vol. pp. 85–92 (2004)
 5. Frantzeskou, G., Stamatatos, E., Gritzalis, S., Chaski, C.E.: Identifying authorship by byte-
    level n-grams: The source code author profile (SCAP) method. Int. Journal of Digital Evi-
    dence 6 (2007)
 6. Frantzeskou, G., Stamatatos, E., Gritzalis, S.: Supporting the cybercrime investigation pro-
    cess: effective discrimination of source code authors based on byte-level information. In:
    E-business and Telecommunication Networks, pp. 163–173. Springer (2007)
 7. Juola, P.: Ad-hoc Authorship Attribution Competition. In: Proceedings of 2004 Joint In-
    ternational Conference of the Association for Literary and Linguistic Computing and the
    Association for Computers and the Humanities (ALLC/ACH 2004) (2004)
 8. Kešelj, V., Cercone, N.: Cng method with weighted voting. In: P. Joula, Ad-hoc Authorship
    Attribution Competition (ALLC/ACH 2004), Göteborg, Sweden (2004)
 9. Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship
    attribution. In: Proceedings of the Pacific Association for Computational Linguistics (2003)
10. Layton, R., McCombie, S., Watters, P.: Authorship attribution of irc messages using inverse
    author frequency. In: Cybercrime and Trustworthy Computing Workshop (CTC), 2012 Third.
    pp. 7–13. IEEE (2012)
11. Layton, R., Watters, P., Dazeley, R.: Authorship attribution for twitter in 140 characters or
    less. In: 2010 Second Cybercrime and Trustworthy Computing Workshop. pp. 1–8. IEEE
    (2010)
12. Layton, R., Watters, P., Dazeley, R.: Automated unsupervised authorship analysis using evi-
    dence accumulation clustering. Natural Language Engineering, Available on CJO (2011)
13. Layton, R., Watters, P., Dazeley, R.: Recentred Local Profiles for Authorship Attribution.
    Journal of Natural Language Engineering (2011), available on CJO 2011
14. Stamatatos, E.: Author identification using imbalanced and limited training texts. In:
    Database and Expert Systems Applications, 2007. DEXA’07. 18th International Workshop
    on. pp. 237–241. IEEE (2007)