Higher Criticism as an Unsupervised
                      Authorship Discriminator
                         Notebook for PAN at CLEF 2020

                                          Alon Kipnis

                                     Department of Statistics
                                      Stanford University
                                     kipnisal@stanford.edu


        Abstract We adapt the Higher Criticism (HC) as an unsupervised untrained dis-
        criminator of two documents. Our method takes word-by-word p-values based on
        a binomial allocation model of words between the documents and combines these
        p-values to a single test statistic using HC. Large values of HC provide evidence
        that the two documents are different in terms of authorship. Despite its simplicity,
        the method achieves competitive results in the Cross-domain Authorship Verifi-
        cation challenge.


1     Overview
Consider two word-frequency tables representing word occurrences across two docu-
ments. We would like to check whether the two tables can be regarded as two samples
from the same unspecified distribution, or not. Of course, ‘word’ here could actually
assume an extended meaning, such as ‘n-grams’, n-tuples of consecutive ‘dictionary
words’, or indeed other features of the text that we can render into counts-of-occurrence
tables.
    In our recent work [4], we proposed a test for this problem based on the Higher
Criticism (HC) statistics [1]. This test has two steps. In the first step we perform many
exact binomial tests; one test for each word in a prescribed dictionary. The result of
each test is a P-value according to a binomial allocation model of the words between
the two documents. This model states that each occurrence of the token is equally likely
to appear in either document, only accounting for the differences in the sizes of the doc-
uments. The second step takes the P-values resulting from the first step and combines
them to a single score using the HC statistic.
    In a recent study, we show that the method has some optimal theoretical properties
when using the binomial word allocation model under ‘rare/weak departures’ setting
[2]. In this setting, if the two distributions are different, they differ only in relatively
few words and only by relatively subtle amounts.
    In practice, it is unreasonable to assume that the underlying binomial word alloca-
tion model is correct, as there may be departures from this model due to topic structure
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa-
    loniki, Greece.
and other violations of binomial word allocation. Nevertheless, the analysis of [4] shows
that our method performs well in several authorship challenges even in the presence of
such violations. The performance of our method in the PAN2020 Authorship Verifi-
cation Challenge [3] shows that it serves as an effective authorship discriminator that
requires very little tuning.
    The section below provides details on the implementation of the method in the
aforementioned authorship verification task.


2     Detailed Description

2.1   Vocabulary

We use a vocabulary consisting of 350 of the most common words, bi-grams, and tri-
grams in the small calibration set. Each document is reduced to its associated word-
frequency table over this vocabulary.
    Out of the 350 words HC automatically selects a much smaller list of words tailored
to each case where it is applied; the evidence for effective discrimination is thought to
lie within that selected list [4]. Consequently, the accuracy of the method appears to be
insensitive to vocabulary sizes larger than 350.


2.2   Calibration

Our method only requires calibration of the HC score to produce the probability of the
event ‘same author’. This calibration is done by evaluating the empirical distribution of
HC associated with document-pairs over the provided calibration set. We considered
the empirical distribution under the cases of ’same author’ (H0 ) and ’different author’
(H1 ) separately. For simplicity, we fit a normal distribution to each of these empirical
distributions and only store the parameters (µi , σi2 ), i = 0, 1 of the fitted distributions.


2.3   Testing

Given a test case (D1 , D2 ), we first evaluate the HC score HC(D1 , D2 ) of its two fre-
quency tables. We report on p(D1 , D2 ) = Pr {HC(D1 , D2 )|H0 } under the assumption
that                                        (
                                              N (µ0 , σ02 ), H0 ,
                         HC(D1 , D2 ) ∼
                                              N (µ1 , σ12 ), H1 ,

where we assume that a priori, the cases H0 and H1 are equally likely. If p(D1 , D2 )
happens to fall in the interval (0.45, 0.55), we report 0.5 instead.


References

1. Donoho, D., Jin, J.: Higher criticism for detecting sparse heterogeneous mixtures. The
   Annals of Statistics 32(3), 962–994 (2004)
2. Donoho, D., Kipnis, A.: Two-sample testing for large, sparse high-dimensional multinomials
   under rare/weak perturbations (2020), https://arxiv.org/abs/2007.01958
3. Kestemont, M., Manjavacas, E., Markov, I., Bevendorff, J., Wiegmann, M., Stamatatos, E.,
   Potthast, M., Stein, B.: Overview of the Cross-Domain Authorship Verification Task at PAN
   2020. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Labs and
   Workshops, Notebook Papers. CEUR-WS.org (Sep 2020)
4. Kipnis, A.: Higher criticism for discriminating word-frequency tables and testing authorship
   (2019), https://arxiv.org/abs/1911.01208