<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ordinal measures in authorship identification∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Liviu P. Dinu</string-name>
          <email>ldinu@funinf.cs.unibuc.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marius Popescu</string-name>
          <email>mpopescu@phobos.cs.unibuc.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University or Bucharest, Faculty of</institution>
          ,
          <addr-line>Mathematics and Computer Science, 14 Academiei, Bucharest</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <fpage>62</fpage>
      <lpage>66</lpage>
      <abstract>
        <p>The goal of this paper is to compare a set of distance/similarity measures, regarding theirs ability to reflect stylistic similarity between authors and texts. To assess the ability of these distance/similarity functions to capture stylistic similarity between texts, we tested them in one of the most frequently employed multivariate statistical analysis settings: cluster analysis. The experiments are done on a corpus of 30 English books written by British, American and Australian writers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The authorship identification problem is an
ancient and omnipresent challenge, and
almost in every culture there are a lot of
disputed works (Shakespeare’s plays, Moliere vs.
Corneille (Labbe and Labbe, 2001),
Federalist Papers
        <xref ref-type="bibr" rid="ref1 ref5 ref7">(Mosteller and Wallace, 2007)</xref>
        ,
etc.). The problem of authorship
identification is based on the assumption that
there are stylistic features that help
distinguish the real author from any other
possibility. Literary-linguistic research is limited
by the human capacity to analyze and
combine a small number of text parameters, to
help solve the authorship problem. We can
surpass limitation problems using
computational methods, which allow us to explore
various text parameters and characteristics
and their combinations. Using these
methods
        <xref ref-type="bibr" rid="ref4">(van Halteren et al., 2005)</xref>
        have shown
that every writer has a unique fingerprint
regarding language use. The set of language
use characteristics - stylistic, lexical,
syntactic - form the human stylom.
      </p>
      <p>
        Because in all computational stylistic
studies/approaches, a process of comparison
of two or more texts is involved, in a way or
another, there was always a need for a
distance/similarity function to measure
similarity (or dissimilarity) of texts from the
stylistic point of view. These measures vary a lot,
and in the last years a series of different
techniques were used in authorship identification:
approaches based on string kernel
        <xref ref-type="bibr" rid="ref2 ref8">(Dinu, et
al., 2008)</xref>
        , SVM based on function words
frequencies
        <xref ref-type="bibr" rid="ref5">(Koppel et. al., 2007)</xref>
        , standard
distances or ordinal distances
        <xref ref-type="bibr" rid="ref10 ref2 ref8">(Popescu and
Dinu, 2008)</xref>
        .
      </p>
      <p>The goal of this paper is to compare a
set of distance/similarity measures,
regarding theirs ability to reflect stylistic similarity
between texts.</p>
      <p>
        As style markers we have used the
function words frequencies. Function words are
generally considered good indicators of style
because their use is very unlikely to be
under the conscious control of the author and
because of their psychological and cognitive
role
        <xref ref-type="bibr" rid="ref1 ref5 ref7">(Chung and Pennebaker, 2007)</xref>
        . Also
function words prove to be very effective in
many author attribution studies.
      </p>
      <p>The distance/similarity between two texts
will be measured as distance/similarity
between the function words frequencies
corresponding to the respective texts. For this
study we selected some similarity/distance
measures. We started with the most natural
distance/similarity measures: euclidean
distance and (taking into account the statistical
nature of data) Pearson’s correlation
coefficient. Since function words frequencies can
also be viewed as ordinal variables, we also
considered for comparison some specific
similarity measures: Spearman’s rank-order
coefficient, Spearman’s footrule, Goodman and
Kruskal’s gamma, Kendall’s tau.</p>
      <p>To assess the ability of these
distance/similarity functions to capture stylistic
similarity between texts, we have tested them
in one of the most frequently employed
multivariate statistical analysis settings: cluster
analysis. Clustering is a very good test bed
for a distance/similarity measure behavior.
We plugged the distance/similarity measures
selected for comparison into a standard
hierarchical clustering algorithm and applied it
to a collection of 30 nineteenth century
English books. The family trees thus obtained
revealed a lot about the distance/similarity
measures behavior.</p>
      <p>The main finding of our comparison
is that the similarity measures that treat
function words frequencies as ordinal
variables performed better than the others
distance/similarity measures. Treating function
words frequencies as ordinal variables means
that in the calculation of distance/similarity
function the ranks of function words
according to their frequencies in text will be used
rather than the actual values of these
frequencies. Usage of the ranking of
function words in the calculation of the
distance/similarity measure instead of the
actual values of the frequencies may seem
as a loss of information, but we consider
that the process of ranking makes the
distance/similarity measure more robust acting
as a filter, eliminating the noise contained in
the values of the frequencies. The fact that a
specific function word has the rank 2 (is the
second most frequent word) in one text and
has the rank 4 (is the fourth most frequent
word) in another text can be more relevant
than the fact that the respective word
appears 34% times in the first text and only
29% times in the second.</p>
      <p>In the next section we present the
distance/similarity measures involved in the
comparison study, section 3 briefly describes
the cluster analysis, and in section 4 and 5
are presented the experiments, the results
obtained, and suggestions for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Similarity Measures</title>
      <p>
        If we treat texts as random variables whose
values are the frequencies of different words
in the respective texts, then various
statistical correlation measures can be used as
similarity measures between that texts. For two
texts X and Y and a fixed set of words
{w1, w2, . . . , wn} let denote by x1 the
relative frequency of w1 in X, by y1 the relative
frequency of w1 in Y and so on by xn the
relative frequency of wn in X, by yn the relative
frequency of wn in Y .
The Pearson’s correlation coefficient is:
n
r = i=1
xi−x
sx
n − 1
yi−y
sy
where x is the mean of X, y the mean of
Y , sx and sy are the standard deviation of
X, Y , respectively
        <xref ref-type="bibr" rid="ref10 ref2">(Upton and Cook, 2008)</xref>
        .
The correlation coefficient measures the
tendency of two variables to change in value
together (i.e., to either increase or decrease).
r is related with the Euclidean distance, the
2(1 − r) being the Euclidean distance
between the standardized versions of X and Y .
      </p>
      <p>
        The random variables X, Y representing
texts can also be treated as ordinal data, in
which data is ordered but cannot be assumed
to have equal distance between values. In this
case the values of X (and respectively Y ) will
be the ranks of words {w1, w2, . . . , wn}
according to their frequencies in text X rather
than of the actual values of these
frequencies. The most common correlation statistic
for ordinal data is Spearman’s rank-order
coefficient
        <xref ref-type="bibr" rid="ref10 ref2">(Upton and Cook 2008)</xref>
        :
      </p>
      <p>6
rsc = 1 − n(n2 − 1) i=1(xi − yi)2
n
To be noted that, this time, xi, yi are ranks
and actually, the Spearman’s rank-order
coefficient is the Pearson’s correlation coefficient
applied to ranks. The Spearman’s footrule is
the l1-version of Spearman’s rank-order
coefficient:</p>
      <p>3
rsf = 1 − n2 − 1
n
i=1
|xi − yi|</p>
      <p>Another set of correlation statistics for
ordinal data are based on the number of
concordant and discordant pairs among two
variables. The number of concordant pairs
among two variables X and Y is P = |{(i, j) :
1 ≤ i &lt; j ≤ n, (xi − xj)(yi − yj) &gt; 0}|.
Similarly, the number of discordant pairs is Q =
|{(i, j) : 1 ≤ i &lt; j ≤ n, (xi − xj)(yi − yj) &lt;
0}|.</p>
      <p>
        Goodman and Kruskal’s gamma
        <xref ref-type="bibr" rid="ref10 ref2">(Upton
and Cook 2008)</xref>
        is defined as:
γ = P − Q
      </p>
      <p>P + Q</p>
      <p>
        Kendall developed several slightly
different types of ordinal correlation as
alternatives to gamma. Kendall’s tau-a
        <xref ref-type="bibr" rid="ref10 ref2">(Upton and
Cook 2008)</xref>
        is based on the number of
concordant versus discordant pairs, divided by a
measure based on the total number of pairs
(n = the sample size):
      </p>
      <p>P − Q
τa = n(n−1)</p>
      <p>2</p>
      <p>
        Kendall’s tau-b
        <xref ref-type="bibr" rid="ref10 ref2">(Upton and Cook 2008)</xref>
        is
a similar measure of association based on
concordant and discordant pairs, adjusted for
the number of ties in ranks.It is calculated
as (P − Q) divided by the geometric mean of
the number of pairs not tied on X (X0) and
the number of pairs not tied on Y (Y0):
τb =
      </p>
      <p>P − Q
(P + Q + X0)(P + Q + Y0)</p>
      <p>All the above three correlation statistics
are very related, if n is fixed and X and Y
have no tied, then P , X0 and Y0 are
completely determined by n and Q.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Clustering Analysis</title>
      <p>
        An agglomerative hierarchical clustering
algorithm
        <xref ref-type="bibr" rid="ref3">(Duda et. al. 2001)</xref>
        arranges a set of
objects in a family tree (dendogram)
according to their similarity, similarity which in its
turn is given by a distance function defined on
the set of objects. The algorithm initially
assigns each object to its own cluster and then
repeatedly merges pairs of clusters until the
whole tree is formed. At each step the pair of
nearest clusters is selected for merging.
Various agglomerative hierarchical clustering
algorithms differ in the way in which they
measure the distance between clusters. Note that
although a distance function between objects
exists, the distance measure between clusters
(set of objects) remains to be defined. In our
experiments we used the complete linkage
distance between clusters, the maximum of the
distances between all pairs of objects drawn
from the two clusters (one object from the
first cluster, the other from the second).
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In Popescu and Dinu (2009) we have
compared the set of distance/similarity
measures described here on a collection of 21
nineteenth century English books written by
10 different authors and spanning a variety
of genre (the same set of books were used
Group
American
Novelists
American
Essayists
British
Playwrights
Bronte
Sisters
Australian
Novelists</p>
      <p>Author Book
Hawthorne Dr. Grimshawe’s Secret</p>
      <p>House of Seven Gables
Melville Redburn</p>
      <p>Moby Dick
Cooper The Last of the Mohicans</p>
      <p>The Spy</p>
      <p>Water Witch
Thoreau Walden</p>
      <p>A Week on Concord
Emerson Conduct Of Life</p>
      <p>English Traits
Shaw Pygmalion</p>
      <p>Misalliance</p>
      <p>Getting Married
Wilde An Ideal Husband</p>
      <p>Woman of No Importance
Anne Agnes Grey</p>
      <p>Tenant Of Wildfell Hall
Charlotte The Professor</p>
      <p>Jane Eyre
Emily Wuthering Heights
B. Baynton Bush Studies</p>
      <p>Human Toll
Henry Joe Wilson and His Mates
Lawson On the Track</p>
      <p>While the Billy Boils
Miles My Brilliant Career
Franklin Some Everyday Folk and Dawn</p>
      <p>Up the Country: A Saga of...</p>
      <p>Back to Bool Bool
by Koppel et al. (2007) in their
authorship verification experiments). The
experiments have shown that the similarity
measures that treat function words frequencies
as ordinal variables (Spearman’s rank-order
coefficient, Spearman’s footrule, Goodman
and Kruskal’s gamma, Kendall’s tau)
performed better than the distance/similarity
measures that use the actual values of
function words frequencies (Euclidean distance,
Pearson’s correlation coefficient).</p>
      <p>The aim of the actual experiments was
two-folded. Firstly we wanted to see if the
findings in Popescu and Dinu (2009) are
confirmed in the case of a larger set (more
authors, more books) and secondly to further
investigate the ability of some of the
similarity measures (Spearman’s rank-order
coefficient, Goodman and Kruskal’s gamma,
Kendall’s tau) to distinguish between the
different nationality of English language writers
by adding to the data set works of Australian
writers from the same period. To the original
data set of Koppel et al. (2007) we added 9
works of three Australian authors from the
same period, resulting a data set of 30 books
and 13 authors (Table 1).</p>
      <p>To perform the experiments, a set of words
must be fixed. The most frequent
function words may be selected or other
criteria may be used for selection. In all our
experiments we used the set of function words
identified by Mosteller and Wallace (2007) as
good candidates for author-attribution
studies. We used the agglomerative hierarchical
clustering algorithm coupled with the various
distance similarity function employed in the
comparison to cluster the works in Table 1.</p>
      <p>The dendrograms obtained sustain the
results of Popescu and Dinu (2009). The
resulted dendrograms for Euclidean distance
and Pearson’s correlation coefficient (not
shown because of lack of space) are very
similar, which is no surprise taking into account
the close relation between the two measures
(see section 2.1). The problem of these
family trees is that the works of Melville are not
grouped together: one being clustered with
the essays of Thoreau (Moby Dick) and the
other with the novels of Hawthorne. Also,
”My Brilliant Career” of M. Franklin is
clustered with the novels of Charlotte Bronte.
Apart from authorship relation, the
dendrograms reflect no other stylistic relation
between the works (like grouping the works
according to genre or nationality of the authors:
American / English / Australian).</p>
      <p>Spearman’s rank-order coefficient,
Goodman and Kruskal’s gamma and Kendall’s tau
produced the same dendrogram (modulo the
scale).Figure 1 shows the dendrogram for
Kendall’s tau. The dendrogram is perfect:
all works are clustered according to theirs
author. The nationality of the authors is
not reflected in the dendrogram (the authors
with the same nationality are not clustered
together).</p>
      <p>We performed a series of experiments to
test in which cases the nationality of the
authors can be revealed by a stylistic
similarity measure. If only British and Australian
writers are selected, the Kendall’s tau
produced the dendrogram presented in Figure
2. As can be seen the first two branches
correspond to the nationality of the authors:
British writers on upper branch, Australian
writers on lower branch. The same thing
happen when British and American writers are
selected. Again, the writers are clustered
according to their nationality: this time, the
British writers on lower branch and
American writers on upper branch. But when the
subset of American and Australian writers is
clustered using Kendall’s tau, the
nationality of the writers is no longer reflected in the
family tree produced. The works of each
author are clustered together, but there are no
clear branches corresponding to the two
nationalities.</p>
    </sec>
    <sec id="sec-5">
      <title>Future Work</title>
      <p>In this paper we have compared a set of
measures, regarding theirs ability to reflect
stylistic similarity between texts. In future work it
would be interesting to compare these
measures to other possible similarity measures. If
the frequencies of different words in the texts
are treated as probability distributions
instead as random variables, specific measures
can be applied: Kullback-Liebler Divergence
or Cross Entropy.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>C. K. Chung</surname>
            , and
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Pennebaker</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>The psychological function of function words</article-title>
          . In K. Fiedler, ed.,
          <source>Social communication: Frontiers of social psychology</source>
          ,
          <volume>343</volume>
          −
          <fpage>359</fpage>
          . Psychology Press, New York.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>L.P.</given-names>
            <surname>Dinu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popescu</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Dinu</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Authorship Identification of Romanian Texts with Controversial Paternity</article-title>
          .
          <source>Proc. LREC</source>
          <year>2008</year>
          , Marrakech, Morocco.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>R. O.</given-names>
            <surname>Duda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Hart</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. G. Stork. 2001. Pattern</given-names>
            <surname>Classification</surname>
          </string-name>
          (2nd ed.). Wiley-Interscience Publication.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>H. van Halteren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Haverkort</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Baayen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neijt</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Tweedie</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>New machine learning methods demonstrate the existence of a human stylome</article-title>
          .
          <source>Journal of Quantitative Linguistics</source>
          ,
          <volume>12</volume>
          :
          <fpage>65</fpage>
          −
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schler</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. BonchekDokow.</surname>
          </string-name>
          <year>2007</year>
          .
          <article-title>Measuring differentiability: Unmasking pseudonymous authors</article-title>
          .
          <source>J. of Machine Learning Research</source>
          ,
          <volume>8</volume>
          ,
          <fpage>1261</fpage>
          −
          <lpage>1276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Labbe</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Labbe</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>A tool for literary studies: Intertextual distance and tree classification</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          ,
          <volume>21</volume>
          (
          <issue>3</issue>
          ):
          <fpage>311</fpage>
          −
          <lpage>326</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>F.</given-names>
            <surname>Mosteller</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.L.</given-names>
            <surname>Wallace</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Inference and Disputed Authorship: The Federalist</article-title>
          .
          <source>CSLI Publications</source>
          , Stanford.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.P.</given-names>
            <surname>Dinu</surname>
          </string-name>
          ,
          <year>2008</year>
          .
          <article-title>Rank Distance as a Stylistic Similarity</article-title>
          .
          <source>Proceedings COLING</source>
          <year>2008</year>
          ,
          <article-title>Manchester</article-title>
          , UK
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.P.</given-names>
            <surname>Dinu</surname>
          </string-name>
          ,
          <year>2009</year>
          .
          <article-title>Comparing Statistical Similarity Measures for Stylistic Multivariate Analysis</article-title>
          .
          <source>Proceedings RANLP</source>
          <year>2009</year>
          , Borovets, Bulgaria
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Upton</surname>
          </string-name>
          and
          <string-name>
            <given-names>I.</given-names>
            <surname>Cook</surname>
          </string-name>
          .
          <year>2008</year>
          . A Dictionary of Statistics. Oxford Univ. Press, Oxford.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>