=Paper= {{Paper |id=Vol-1179/CLEF2013wn-PAN-FengEt2013 |storemode=property |title=Authorship Verification with Entity Coherence and Other Rich Linguistic Features Notebook for PAN at CLEF 2013 |pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-FengEt2013.pdf |volume=Vol-1179 |dblpUrl=https://dblp.org/rec/conf/clef/FengH13 }} ==Authorship Verification with Entity Coherence and Other Rich Linguistic Features Notebook for PAN at CLEF 2013== https://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-FengEt2013.pdf
      Authorship Verification with Entity Coherence and
               Other Rich Linguistic Features
                        Notebook for PAN at CLEF 2013

                            Vanessa Wei Feng and Graeme Hirst
                                    1 University of Toronto
                                    weifeng@cs.toronto.edu
                                    2 University of Toronto

                                      gh@cs.toronto.edu


       Abstract We adopt Koppel et al.’s unmasking approach [5] as the major frame-
       work of our authorship verification system. We enrich Koppel et al.’s original
       word frequency features with a novel set of coherence features, derived from our
       earlier work [2], together with a full set of stylometric features. For texts written
       in languages other than English, some stylometric features are unavailable due to
       the lack of appropriate NLP tools, and their coherence features are derived from
       their translations produced by Google Translate service. Evaluated on the training
       corpus, we achieve an overall accuracy of 65.7%: 100.0% for both English and
       Spanish texts, while only 40% for Greek texts; evaluated on the test corpus, we
       achieve an overall accuracy of 68.2%, and roughly the same performance across
       three languages.


1     Introduction
Authorship verification, a sub-task of authorship identification, deals with the demand
of identifying whether a two documents are written by the same author or not. Typically,
a set of documents which are known to be written by the author of interest is given,
and an authorship verification system needs to determine whether a given unknown
document is written by this author.
    We follow the unmasking approach described by Koppel et al. [5], which is de-
signed specifically for the task of authorship verification, as the major framework of
our authorship verification system. However, rather than using word-frequency features
as Koppel et al. did, we unmask a pair of documents by using a set of linguistic fea-
tures, including our own coherence features and well-established stylometric features.
Moreover, as sophisticated coreference resolution tools, available only for English, are
required for extracting our novel coherence features, we first translate non-English texts
into English, and from the translations, we extract the coherence features.

2     Methodology
2.1   Unmasking
Unmasking [5] is a technique developed specifically for the task of authorship verifi-
cation. Its underlying idea is that, if two documents were written by the same author,
then any features a classifier finds that (spuriously) discriminate their authorship must
be weak and few in number. On the other hand, if the texts were written by different
authors, then many more features will support their (correct) discrimination.
    Our modified unmasking approach is as follows: From all known documents in the
training corpus, we extract: (1) Ssame , the set of pairs of documents written by same
authors; (2) Sdiff , the set of pairs of documents written by different authors. For each
document pair hdi , d j i, written by author Ai and A j respectively, the two documents,
di and d j , are segmented into equal-sized and non-overlapping small chunks. If the
number of chunks of either document is less than 5, an up-sampling is first performed
to pad the size to at least 5. If the number of segmented chunks of each document is
unequal, a balanced sample is obtained by randomly discarding surplus chunks in the
larger set. The following procedure is repeated N times.

 1. A weak classifier with a set of unmasking features is trained to label each chunk as
    being from document di or d j . The sampling is repeated five times, and the averaged
    leave-one-out cross-validation accuracy is reported to represent the discrimination
    performance.
 2. Remove the top 3 most discriminating features of the weak classifier, and repeat
    Step 1 using the remaining features.

Pair hdi , d j i is therefore unmasked by the degradation of the cross-validation accuracy
after each iteration of feature removal. Such degradation of accuracies is encoded by
a numeric vector using the original representation in [5]. Finally, a binary classifier,
called a meta-classifier, is trained to differentiate same degradation curves (Ai = A j ) vs.
different degradation curves (Ai 6= A j ).


2.2   Enhanced features for unmasking

Our important extension to Koppel et al.’s unmasking approach is the enhancement of
the features used in building the weak classifiers for unmasking the degradation curves
(Step 1 in Section 2.1). Although Koppel et al.’s word frequency feature set achieved
competitive performance for verifying novel-length texts, we found that word frequency
features are too unreliable for much shorter texts. Therefore, we use a more comprehen-
sive feature set, including various rich linguistic features, which can be partitioned into
two categories:


Coherence features As we have shown in earlier work [2], coherence features, based
on the local entity transition patterns derived from Barzilay and Lapata’s entity grids
[1], can be useful discourse-level authorship features. The entity grid model is based
on the assumption that a text naturally makes repeated reference to the elements of a
set of entities that are central to its topic. It represents local coherence as a sequence of
transitions, from one sentence to the next, in the grammatical role of these references.
For example, an entity may be mentioned in the subject of one sentence and then in
the object of the next — or not at all in the next. These coherence features are encoded
as a vector consisting of the relative proportions of a set of predefined entity transition
patterns. As in our earlier work, [2], we use Reconcile-1.03 to extract entities in texts
and resolve coreferences.
    Since we are not aware of any available coreference resolution systems for lan-
guages other than English, it is nontrivial to extract coherence features for Spanish and
Greek texts. However, we believe that, while surface-form authorship features, such
as word usage, are generally obfuscated in the process of being translated to English,
coherence features, as a kind of discourse-level feature, are relatively well preserved.
Therefore, we first use the Google Translate service4 to obtain their English translations,
then perform the entity extraction on these translations.


Stylometric features In addition, we use a set of well-established stylometric features,
the majority of which are from our earlier work [4], including (1) Basic features: the
average sentence length (in words), the average word length (in characters), lexical
density, word length distribution; (2) Lexical features: frequencies of function words,
hapax legomena, and hapax dislegomena; (3) Character features: frequencies of various
characters; (4) Syntactic features: part-of-speech entropy, frequencies of part-of-speech
bigrams, and frequencies of syntactic production rules. English texts are parsed by the
Stanford CoreNLP toolkit5 , and Spanish texts are parsed by the FreeLing toolkit6 . We
use the AUEB Greek part-of-speech tagger7 to obtain the part-of-speech tags for Greek
texts (full syntactic parsing is not available for Greek texts).
    The total number of features is 538 for English, 568 for Greek, and 399 for Spanish
texts.


2.3   Parameter configurations

There are a few parameters that can be adjusted in our approach. We tested several pa-
rameter combinations, and decided to use the following configurations which achieved
the best performance on the training data.

Chunk sizes: English and Spanish texts are chunked into 200 words, while Greek texts
are chunked into 100 words. In both cases, any leftover in a document is discarded.

Unmasking iterations: The unmasking procedures in Section 2.1 are repeated N times
in order to obtain the degradation curve. For English texts, N = 20, while for both Greek
and Spanish texts, N = 10.

Classifiers: For the weak classifier used in unmasking degradation curves, we use the
linear-kernel LibSVM classifier, implemented by Weka 3.7.7 [3], and the top 3 most
discriminating features to be removed in each unmasking iteration are chosen as the
3 http://www.cs.utah.edu/nlp/reconcile/
4 http://translate.google.com/
5 http://nlp.stanford.edu/software/corenlp.shtml
6 http://nlp.lsi.upc.edu/freeling/
7 http://nlp.cs.aueb.gr/software.html
3 features with the highest absolute-value weight in the linear kernel. For the meta-
classifier to differentiate same degradation curves vs. different degradation curves, we
use the Bagging classifier offered by the Weka package. All these classifiers use the
default parameters setting.


3   Experiments and Result

We use only the training corpus released by PAN 2013 Authorship Identification task
(10 English cases, 20 Greek cases, and 5 Spanish cases) and no other complementary
materials. A separate model is built for each language.
    In training, for a particular language, we use all known documents written in this
language in the training corpus to unmask same and different degradation curves. Since
there are typically many more different degradation curves than same ones, we sampled
at most 500 same curves and 1000 different curves.
    For evaluation, we unmask each unknown document in the corpus, against the given
known documents of the same case, to get a degradation curve corresponding to this
unmasking. We then use the trained meta-classifier to classify the resulting degradation
curve as same or different, and thus determine the final answer.
    We produce a yes/no answer for all cases, and obtained an overall accuracy of 65.7%
for all 35 cases in the corpus. Specifically, for each language, we obtained 100.0% for
both English and Spanish, while only 40% for Greek texts. The final evaluation result of
our system is an overall accuracy of 68.2%, with roughly the same performance across
three languages.
    Because the unmasking features in our authorship verification system are designed
specifically for English texts, by using well-established stylometric features and our
novel coherence features, we expect that by better understanding a particular language,
especially those with fewer available NLP tools, more effective unmasking features can
be designed to achieve competitive performance on non-English languages as well.


References
1. Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. In:
   Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. pp.
   141–148. ACL 2005, Association for Computational Linguistics, Stroudsburg, PA, USA
   (2005)
2. Feng, V.W., Hirst, G.: Patterns of local discourse coherence as a feature for authorship
   attribution. Literary and Linguistic Computing (2013)
3. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA
   data mining software: An update. SIGKDD Explorations 11(1) (2009)
4. Hirst, G., Feng, V.W.: Changes in style in authors with Alzheimer’s disease. English Studies
   93(3), 357–370 (2012)
5. Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: Unmasking
   pseudonymous authors. The Journal of Machine Learning Research 8, 1261–1276 (2007)