Authorship identification using correlations of
                      frequent features
                        Notebook for PAN at CLEF 2013

                                       Timo Petmanson

                      Institute of Computer Science, University of Tartu
                                       timo_p@ut.ee


       Abstract In this work, we explore how well can we perform the PAN‘13 author-
       ship identification task by using correlation of frequent significant features found
       in documents written by a known author. We extract features from the context
       of four types of words: first words of sentences or lines, nouns, verbs, punctua-
       tion. We compute the Matthews Correlation Coefficient for all pairs of extracted
       features and by using principal component analysis, transform them into a form
       suitable for using simple Euclidian distance as a metric. By hypothesizing that
       the distances of different documents written by a same author belong to same
       distribution, we can provide educated guesses for the authorship identification.
       Our system achieves F1 -score of 66.7% for English, 56.7% for Greek and 80%
       for Spanish texts with an overall F1 -score of 67.1% on the PAN‘13 dataset.


Introduction
A single subtask of the PAN‘13 authorship identification task contains up to ten docu-
ments written by a known author and a single document, that needs to be classified as
either written by the same author or not. The dataset contains documents in English,
Greek and Spanish languages. Similar task was also presented at PAN‘11 [2], but with
a fixed set of authors.
     Our approach is to first extract all frequent and statistically significant features from
all given documents in a single subtask. Then, we use the correlation of the features to
compute the distances between all given documents. We use Student‘s t-test to deter-
mine if the distances come from the same distribution, hence same author. On the pre-
liminary data, our approach achieved F1 -score of 66.7% for English, 56.7% for Greek
and 80.0% for Spanish texts with an overall F1 -score of 67.1%


1   Feature extraction
Given a single subtask, we started by extracting lexical and morphological features.
Lexical features are the original word, does the word start with an uppercase character,
are all characters uppercase, does contain punctuation, does contain digits, is first or
last word in a sentence or a line. Morphological features are the word lemma, part-of-
speech, other language specific features such as the case, mood etc. Similar features
have traditionally been used in authorship identification tasks [8,2].
   For the feature extraction, we used NLTK toolkit for POS-tagging English [3], Tree-
Tagger for Spanish [7] and AUEB Tagger for Greek [4]. We encoded the features for
each word as lists of tuples

                                (feature, offset, value) ,

where offset determines the feature/value combination of the word relative to the posi-
tion of current word. For instance, a feature (case, -1, nominative) would say that the
case of the previous word was nominative. We enumerated all possible features with
the offset ranging from -2 to 2. Additionally, we were interested in extracting compos-
ite features that could express the co-occurrence of two or more simple features. For
example, a composite feature could be (pos-tag, 1, noun) & (case, 0, partitive).

Frequent composite features. We a say a feature is frequent, if its support (number of
occurrences) is greater or equal to a predefined threshold. We set the treshold so that
every feature matching at least 5% of tokens would be frequent. Particularly, we are
interested in extracting frequent features for four different types of tokens: first words
of sentences or lines, nouns and adjectives, verbs, punctuation. The tresholds were set
respectively to the number of tokens in each particular group.
    Counting the support of simple features is straightforward. To obtain frequent com-
posite features, we can use the monotonicity property of support. Given two features A
and B, we can assure that

                 support(A&B) ≤ min(support(A), support(B)) ,                         (1)

which means that if either of the features A and B are not frequent, then also their con-
junction A&B is not frequent. The monotonicity property makes it possible to extract
all frequent composite features efficiently using the Apriori algorithm used in frequent
itemset mining [1]. Basic and composite features can be encoded as itemsets and tokens
as transactions.

Statistical significance. Not all frequent features are relevant to authorship identifica-
tion due to common frequent words and language-specific grammatical and stylistic
patterns. For this purpose, we compiled a random subsets of documents from Brown
corpus [5], Spanish and Greek Wikipedia. Each subset contained 100 documents.
    For each frequent feature, we first compute the percentage of matched tokens on
every document in the reference corpus of the subtask language. Then, we estimate
the p-value as the fraction of documents with higher or equal percentage of matches
than was obtained on the documents in the subtask. Next, we filter out the features that
are not statistically significant (p-value 0.025). This will help to reduce the number of
irrelevant features.


2   Feature correlation as a distance metric

We use Matthews correlation coefficient (MCC) [6] to measure the similarity between
two different features in a single document as we represent each feature as a bitvector
of matched positions and MCC has proved to be more stable with binary vectors. We
compute MCC between all pairs of features of document and compile a vector length
m = k(k − 1)/2, where k is the number of features. Given n documents, we com-
pile a n × m matrix to store the correlation vectors. Each row represents the feature
correlations of a particular document.
     As we extracted features for four different groups of words: first words of a sentence
or a line, verbs, nouns, punctuation, we compile a total of four such matrices for each
subtask and by row-wise concatenation put them in a single matrix M containing all
the pair-wise correlations for features in every particular group of words.
     Next, we find n − 1 principal components from matrix M and project the data as a
new matrix P . As a result, we can view each document as a point in (n−1)-dimensional
Euclidian space and use Euclidian distance to compute the distance between any two
documents. As an alternative, we also measured the distance between the documents
by a computing mean-squared errors (MSE) between respective correlations, but MSE
did not prove to be as discriminating as the approach using PCA.
     We hypothesize that if the unknown document is not written by the same author, its
mean distance to known documents should be significantly larger than the mean simi-
larity between known documents. If average mean distance of known vs unknown doc-
ument was less or equal to mean distance between known and known, we automatically
report the unknown document to written by the same author. Otherwise, we use one-
tailed Student’s t-test to obtain the p-value, which describes the statistical significance
between the means. We use rather large threshold 0.5 to determine the authorship:
                            (
                             pvalue ≥ 0.5       Y (same author)
                   result =
                             pvalue < 0.5       N (different author) .

    Some subtasks of PAN‘13 dataset only contained one known document from the
author. In such cases, we split the known document into two and compared the first and
second part of the document as our approach requires at least two known documents
from an author.


3   Evaluation

In Figure 1, we have depicted the distances our method for all three languages on train-
ing subset of the data that was available during writing this notebook. We see that our
distance metric is quite discriminating for English and Spanish languages as documents
written by same authors are on average more similar than the documents written by
someone else. Our approach seems to have most difficulty with Greek language as the
distances do not reflect very well the authorship of the documents.
    On the full evaluation dataset, our approach achieved following F1 -scores: English
- 66.7%, Greek - 56.7%, Spanish - 80.0% with overall F1 -score of 67.1%. On Spanish
subset, our approach was ranked second. Overall, our solution shared fourth and fifth
ranking from a total of eighteen participants.
                        EN                          GR                         SP
                                        ●       ●
                                                ●                 ●
                                                                  ●
                                                                  ●              known vs unknown
            3                                   ●
                                                ●
                                                ●
                                                                                 known vs known
                                                ●
                                                ●
                                                ●
                                                ●
                                                ●
            2                                   ●
 distance


            1
                    ●


            0
                N            Y              N                 Y          N              Y
                                                true answer


Figure 1. Feature correlation based distances between known and unknown documents for En-
glish, Greek and Spanish languages in training subset of PAN‘13 authorship identification dataset.


Summary
In this work, we have shown that the features we extract contain rather strong signal
relevant for authorship identification. At least our t-test based approach handles certain
cases correctly. However, we might improve the results by using more sophisticated
novelty detection algorithms instead. Also, we might get stronger features by perform-
ing PCA separately on each of the four token groups. The next important step in our
future work is evaluating these options.


References
1. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proc. 20th
   Int. Conf. Very Large Data Bases, VLDB. vol. 1215, pp. 487–499 (1994)
2. Argamon, S., Juola, P.: Overview of the international authorship identification competition at
   pan-2011. In: CLEF 2011: Proceedings of the 2011 Conference on Multilingual and
   Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers),
   Amsterdam, The Netherlands (2011)
3. Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL on
   Interactive presentation sessions. pp. 69–72. Association for Computational Linguistics
   (2006)
4. Koleli, E.: A new Greek part-of-speech tagger, based on a maximum entropy classifier.
   Master’s thesis, Department of Informatics, Athens University of Economics and Business
   (2011)
5. Kučera, H., Francis, W.N.: Computational analysis of present-day American English.
   Dartmouth Publishing Group (1967)
6. Matthews, B.W.: Comparison of the predicted and observed secondary structure of t4 phage
   lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2), 442–451 (1975)
7. Schmid, H.: Treetagger. TC project at the Institute for Computational Linguistics of the
   University of Stuttgart (1994)
8. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American
   Society for information Science and Technology 60(3), 538–556 (2009)