Adapting for Subject-Specific Term Length
       using Topic Cost in Author Verification
                       Notebook for PAN at CLEF 2015

                             Anna Vartapetiance and Lee Gillam

                      Department of Computing, University of Surrey, UK
                           {a.vartapetiance, l.gillam}@surrey.ac.uk


       Abstract. Previous PAN workshops have offered us the opportunity to explore
       three different approaches using basic statistics of stopword pairs for author
       verification. In this PAN, we were able to select our ‘best’ approach and
       explore the question of how authors writing about different subjects would
       necessarily adapt to term lengths specific to the subject. The adaptation required
       is, essentially, a redistribution of frequency: where longer terms occur. We
       introduce the notion of a ‘topic cost’ which increases the propensity for
       matching. Results show AUC and C1 scores of 0.51, 0.46 and 0.59 for Dutch,
       Greek and Spanish respectively. The English results are not yet available, as the
       evaluation system was unable to run the approach due to as yet unknown
       reasons.


1    Introduction

   In the 6th International Workshop on Uncovering Plagiarism, Authorship, and
Social Software Misuse (PAN2012), we gave first test to our ideas on co-occurrence
patterns of stopwords [1]. At the 8th iteration (PAN 2014), we presented 3 variations
to our approach, largely geared around evaluating use of similarity/distance over
vector spaces [2].
   In this paper, we suggest extension to our approaches to the PAN2014 by
accounting for a ‘topic cost’. Simply, there are several reasons why specific
stopword-pair separation may be less able to indicate similarity, and accounting for
term length and term count offers potential for addressing this. In section 2, we briefly
discuss the previous approaches we have used for author verification. Section 3
explains how we determine and use topic cost. Section 4 offers results and evaluation,
and Section 5 concludes the paper.


2    Previous methods applied

  As discussed in [1], for PAN2012, we approached author ‘attribution’ using a
mean-variance framework on patterns of stopwords with a specified maximum
window size for pairs of the 10 most common English stopwords to identify
positional frequencies, and allocated an author based on nearest frequency-mean-
variance match.
   For PAN2013, the core approach remained the same with output adapted to the
Boolean output required. The task introduced Greek and Spanish texts, of which the
authors have no real knowledge, and so lists of 10 frequent stopwords were sought for
each.
   For PAN2014, we reused these stoplists along with a stoplist for Dutch – with
Dutch as yet another language of which the authors have no real knowledge. We also
evaluated 3 approaches based on:
   Frequency-Mean-Variance: We follow the approach detailed at length in
Vartapetiance and Gillam 2013, generating frequency information for stopword pairs,
determining mean and variance for separation, then applying cosine distance to
compare the resulting feature vectors.
   Positioning: This approach is based on FMV, above, but omits step 4 and so acts
as a cosine comparison on positional frequencies for each pattern. This would tend to
require comparable frequencies for each feature to ensure a good match.
   Cosine: We modify the Positioning approach to consider the frequency
information for all patterns as a single vector, then apply cosine distances between
resulting vectors. Here we also consider how to determine a match: a single cosine
distance between one known and one unknown; a difference in distance within a
threshold when two known texts can be compared; and distances between the
unknown and many known texts to be at a suitable point on the distribution of
distances amongst knowns. Acceptability, according to thresholds, and cosine
distance can then be used together to determine match confidence.


3    PAN 2015

   For this year’s task, we wanted to explore the ability to match where the same
author may necessarily vary their writing according to the topic. This would account
for, say, simple temporal modification– discussing for example ‘the former Prime
Minister of’ rather than ‘the Prime Minister of’ – but is principally geared to account
for differences in term lengths as relate to topics. In the ‘Prime Minister’ example
given, the same stopword pair of the-of is present, but with a positional mismatch.
Since position, and variability in position, is core to our approaches, we require a
simple way to address the pattern-specific positional mis-alignment that occurs.
   To approach this, we introduce the notion of a ‘topic cost’ and distribute positional
frequencies according to this topic cost. To determine topic cost, we simply count the
number of terms and the length of these terms, and use the difference between these
values for redistribution. The only additional resource employed is a language-
specific stoplist as exposes the terms.
   As an example, consider the following passage of text:
         UK interest rates have been kept unchanged again by the Bank of England,
         meaning they have now been at their record low of 0.5% for six years. Rates
          were first cut to 0.5% in March 2009 as the Bank sought to lift economic
          growth amid the credit crunch.
   Take stopword pairs as formed from [the, of, in, for, to]. If we ignore the sentence
break, the first pair of interest offers us: “for six years. Rates were first cut to”. The
distance covered by the pair is 6 (the number of words between “for” and “to”).
Collecting all multi-word terms, using all stopwords (not just those listed) as
delimiters (and, here, the full-stop also), results in 3 terms comprising 5 words – six
years, rates, first cut. The topic cost, then, is 2. Instead of counting once at position 6,
we uniformly distribute – other weightings possible but unexplored - across position 6
and the two preceding positions and so positions 4, 5 and 6 each receive 0.333. This
example, and further from the above passage, are shown in the table below.
              Table 1: Example of ‘Topic Cost’ applied on sample sentence
Extracted text       Gap     Remove all           Topic        Shift (word, gap, count)
                             stops                cost
for six years.       6       six years            2            for-to, 6, 1 becomes
Rates were first             rates                             for-to, 6, 0.333
cut to                       first cut                         for-to, 5, 0.333
                                                               for-to, 4, 0.333
to 0.5% in           1       0.5%                 0            No change
to lift economic     4       lift  economic       3            to-the, 4, 1 becomes
growth amid the              growth amid                       to-the, 4, 0.25
                                                               to-the, 3, 0.25
                                                               to-the, 2, 0.25
                                                               to-the, 1, 0.25
in March 2009        3       March 2009           1            in-the, 3, 1 becomes
as the                                                         in-the, 3, 0.5
                                                               in-the, 2, 0.5
the Bank of          1       Bank                 0            No change
the Bank sought      2       Bank sought          1            the-to, 2, 1 becomes
to                                                             the-to, 1, 0.5
                                                               the-to, 2, 0.5

  In principle, use of topic cost offers greater potential for match using our previous
approaches. In practice, the extent of improvement over previous results is likely to be
marginal.


4     Results

   Results for each of the PAN 2015 collections are shown in the table below based
on 4 language categories.
                Table 2: Results from our approaches for Test Corpus

                 Collection              AUC        C1          Score
                 Dutch                     0.51      0.51         0.262
                 English                     ---       ---            ---
                 Greek                     0.46      0.46         0.212
                 Spanish                   0.59      0.59         0.348

   Due to yet unknown problem with English run, the system was unable to calculate
the outcomes of the test. Also, unfortunately, the results from the runs using last
year’s systems will not be available until after this paper is submitted, so the authors
are not able to provide a comparison between systems to see whether or not this
approach improves the outcome of detection. However, the results on runs on training
datasets using FMV, Positioning and Topic Cost systems (Table 3) show some
improvements in detection using the new system.
        Table 3: Results from FMV, Positioning and Topic Cost systems based on
                                  Training Corpus
                    Collection                   AUC
                                 FMV       Positioning   Topic Cost
                    Dutch          0.5            0.49         0.46
                    English       0.46            0.51         0.53
                    Greek         0.45            0.51         0.56
                    Spanish       0.54            0.55         0.56
                    Average       0.49            0.52         0.53


5    Conclusions and Future Work

   In this paper, we suggested an extension to our approaches to PAN2014 for
authorship verification by accounting for a ‘topic cost’. For us, topic cost may account
for lower match values in our previous approaches, and our intention was to
determine whether a simple treatment of topic cost could improve our results. This
modification does require much more testing in respect to the test collections of
previous years to fully appreciate its effect. Unfortunately, other activities hindered
the authors’ abilities to allocate sufficient time to this testing during this round of
PAN.


Acknowledgements

   The authors gratefully acknowledge prior funding from the UK’s Technology
Strategy Board (TSB, 169201), and also the efforts of the PAN organizers in crafting
and managing the tasks.
References

1    A. Vartapetiance and L. Gillam, “Quite Simple Approaches for Authorship
     Attribution , Intrinsic Plagiarism Detection and Sexual Predator Identification
     - Notebook for PAN at CLEF 2012,” in Working Notes Papers of the CLEF
     2012 Evaluation Labs, 2012.

2    A. Vartapetiance and L. Gillam, “A Trinity of Trials : Surrey ’ s 2014
     Attempts at Author Verification Notebook for PAN at CLEF 2014,” Work.
     Notes Pap. CLEF 2014 Eval. Labs, 2014.