<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Zeta &amp; Eta: An Exploration and Evaluation of Two Dispersion-based Measures of Distinctiveness</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Keli Du</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Dudar</string-name>
          <email>dudar@uni-trier.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cora Rok</string-name>
          <email>rok@uni-trier.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christof Schöch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Trier</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>181</fpage>
      <lpage>194</lpage>
      <abstract>
        <p>In Corpus Linguistics, numerous statistical measures have been adopted to analyze large amounts of textual data in a contrastive perspective, in order to extract characteristic or “distinctive” features. While the most widely-used keyness measures are based on word frequency, an increasing number of research papers recently suggested dispersion-based measures as a better solution. These, however, are not new to Computational Literary Studies (CLS). In 2007, John Burrows introduced Zeta, a statistical measure that is mainly based on the degree of dispersion of a feature in a text corpus. In this paper, we also introduce Eta, a new measure of distinctiveness that is based on deviation of proportions suggested by Stefan Gries. By comparing Eta with Zeta, we demonstrate that both measures are able to identify relevant, interpretable distinctive words in a target corpus. Additionally, we make a first attempt to detect the key diferences between these two measures by interpreting the top distinctive words.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Computational Literary Studies</kwd>
        <kwd>measure of distinctiveness</kwd>
        <kwd>Zeta</kwd>
        <kwd>Eta</kwd>
        <kwd>dispersion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In Linguistics and Literary Studies, comparing groups of texts – e.g. belonging to diferent
literary genres or written for diferent audiences – is a fundamental procedure [ 11, see e.g., ].
In Corpus Linguistics, numerous statistical measures and instruments have been introduced
and adopted for investigating and analyzing large amounts of textual data in a contrastive
perspective [e.g. 20, 17, 15]. They are usually referred to as ’keyness measures’, as they
operate on a lexical level and are used for extracting “key” terms or phrases. We prefer the
term ’measures of distinctiveness’, as it better emphasizes that this kind of analysis is about
the extraction of characteristic words on the basis of a comparison [see 24].</p>
      <p>
        The most widespread keyness measures used in Corpus Linguistics are frequency-based – for
example, the chi-squared test or the log-likelihood-ratio test [25], implemented e.g. in AntConc
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Recently, several research papers suggested dispersion-based measures as a better solution
for contrastive corpus analysis [e.g. 4, 8, 7]. Apart from that, the use of dispersion in the
search for important text features is not new to Computational Literary Studies (CLS). In
2007, John Burrows introduced Zeta, a keyness measure that is mainly based on the degree of
dispersion of a feature in a text corpus [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Originally, it was used in the context of authorship
attribution, but it later came to be used also to solve other issues in CLS, including corpus
comparison [e.g. 3, 9, 23].
      </p>
      <p>There are several important studies that explore and evaluate frequency-based measures [e.g.
10, 18, 12, 19, 6], and some studies that compare dispersion based measures to frequency based
measures [e.g 4, 8, 12]. However, as far as we know, no attempt has been made to compare
the dispersion-based measures to each other. In our project “Zeta and company”1 we aim to
enhance the understanding of both frequency- and dispersion-based measures by implementing
them in a Python framework. Based on tests with literary texts we evaluate which measures
perform best for diferent tasks and kinds of textual data. This article presents a pilot study
in our project and it aims to perform a statistical analysis and a qualitative evaluation of two
dispersion-based distinctiveness measures: (1) Eta, which is based on deviation of proportions
(DP), developed by Stefan Gries; (2) Zeta, which was proposed by John Burrows.2</p>
      <p>Firstly, we will explain how Eta and Zeta are calculated. After that, using a collection of
160 novels of four diferent subgenres published in France in the 1980s, we will examine how
Eta behaves in contrast to Zeta and how their relationship changes when the segment length
varies. The following questions will be addressed: How useful is Eta as a basis for identifying
distinctive words in one text group compared to another text group? What are the diferences
between Eta and Zeta and what results do they display?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Keyness analysis: from frequency to dispersion</title>
      <p>
        Despite the dominance of frequency-based keyness measures (e.g. chi-squared test, log-likelihood
ratio test), there are several alternative measures which consider other types of information like
the distribution of words (e.g. t-Test, Mann-Whitney-U-test) and their dispersion (e.g. Zeta).
A helpful overview of the frequency- and distribution-based measures can be found in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
In addition, Machine Learning-approaches (e.g. weights of a linear SVM) or entropy-related
approaches (e.g. Kullback-Leibler divergence, see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) can be used to identify distinctive words
in a target corpus.
      </p>
      <p>As already mentioned, the most widely used keyness measures in Corpus Linguistics are
frequency-based and they do not consider how the particular words are distributed within a
corpus. This means that a word can be marked as distinctive for the entire target corpus,
even if it just appears very frequently in a small number of texts. For illustration, Figure 1
presents the result of an analysis carried out using AntConc’s log-likelihood ratio test on our
working corpus (described below): keywords where extracted from a comparison of 40 French
science fiction novels (as the target corpus) with 120 French novels of other subgenres (as
the comparison corpus).3 It turns out that the top-ranked words are almost entirely proper
names. Each of them appears only in one novel of the target corpus, albeit very frequently,
and likely not at all in the comparison corpus and therefore cannot truly represent the entire
target corpus. In order to obtain more meaningful results, proper names should be pruned
from the list.</p>
      <p>
        To deal with this challenge, the dispersion of a feature, which is the degree of an even
distribution of a feature, should be considered as well (on dispersion, see [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]; for the use
      </p>
      <sec id="sec-2-1">
        <title>1See: https://zeta-project.eu/en/.</title>
        <p>2We have implemented both measures in our Python framework. See: https://github.com/
Zeta-and-Company/pydistinto.</p>
        <p>
          3AntConc 3.5.9 [see 1] was used with the following keyness parameters: Log-Likelihood (4-way) and a p-value
cut-of of 0.001. The measure of efect size shown is DIFF.
of dispersion for keyness analysis, see [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]). Gries [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] gives a detailed overview of dispersion
measures and proposes his own measure, called deviation of proportions (DP).
        </p>
        <p>
          DP compares the diference between observed and expected relative frequency of a word in
every single document of the corpus in order to quantify the dispersion of the word:
DP is calculated as follows: for each corpus part (e.g., a file), compute s, which
represents how much of the corpus it constitutes (as a fraction of the whole corpus)
and v, which represents how much of the word in question it contains (as a fraction
of the word’s frequency). Then subtract all s-values from all v-values, take the
absolute values of those diferences, sum them up, and divide by two [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>DP =</p>
        <p>n
∑i=1 |si − vi|
2</p>
        <p>The theoretical range of DP values is between 0 and 1. A value of 0 reflects a perfectly even
dispersion, while a value of 1 represents a maximally uneven dispersion. This measure seems
to have several advantages compared to other dispersion measures. For example, it can handle
corpus parts of diferent lengths and it can distinguish between slight variations in distribution
without being overly sensitive. However, there is still a lack of empirical evidence supporting
the use of DP.</p>
        <p>As mentioned before, Burrows’ Zeta also considers dispersion and it is calculated by
comparing the document proportion (docP) of each feature in the target and in the comparison
corpus. At first, each text in each group is divided into segments of a certain length (segment
length is a key parameter of the measure). For each word w in the vocabulary, docP is
calculated by establishing the proportion of segments in which the word occurs at least once, so
docP ranges between 0 and 1.</p>
        <p>In order to find out whether a word is distinctive for the target coups, the docP or devP 4
values of the word in the target and the comparison corpus must be compared, respectively.
Based on docP and devP, two measures of distinctiveness can be defined. The Zeta score of
(w) is the subtraction of docP in the comparison corpus from that in the target corpus [see
21]. Therefore, the theoretical range of the Zeta score is between -1 and 1. The words with
the highest Zeta scores are the most distinctive words of the target corpus. By analogy, and
using devP instead of docP as the measure of dispersion, a new measure of distinctiveness can
be defined, which we call Eta. It is obtained by subtracting the devP of a word (w) in the
comparison corpus from the devP of the same word in the target corpus. Contrary to docP, a
small devP of a word reflects a more even distribution of a feature in a corpus. It is therefore
expected that the devP of distinctive words in the target corpus is smaller than the devP of
these words in the comparison corpus. So the words with the lowest Eta scores are the most
distinctive words of the target corpus.5 As we can see here, although Zeta and Eta are both
dispersion-based measures, they have a diferent mathematical definition of dispersion. As Eta
takes into account the ratio of document size and corpus size, which Zeta doesn’t, we intend
to test whether or not Eta performs better in detecting distinctive words than Zeta.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Tests and results</title>
      <sec id="sec-3-1">
        <title>3.1. Corpus</title>
        <p>The corpus used in this study is a collection of 160 novels published in France between 1980
and 1989. 120 of them are lowbrow novels of three subgenres (40 novels for each subgenre):
sentimental novels, crime fiction and science fiction. The remaining 40 are highbrow novels.</p>
        <sec id="sec-3-1-1">
          <title>4We use devP instead of DP to better distinguish between the two terms.</title>
          <p>5Only words which appear at least once in both corpora will be considered here and in the following, because
devP does not yield meaningful results otherwise.
The corpus size is approximately nine million words. All texts have been lemmatized using
Treetagger and the units of calculation are lemmas. As our goal was to extract distinctive
lemmas for each subgenre, we used a one-vs-rest strategy: the target corpus contains 40 novels
of one subgenre and the comparison corpus contains 120 novels of the other three subgenres.
This allowed us to focus on extracting distinctive features that are strongly related to the
unique characteristics of the target corpus.6</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Statistical observations</title>
        <p>The results of our comparative analysis are two lists of words which are ranked by their Zeta or
Eta scores, respectively. To compare the diferences of Zeta and Eta, we measure the ranking
correlation between the two word lists using Spearman’s rank correlation. The stronger the
correlation, the less diferent these two word lists are. We performed tests on four comparison
groups: sci-fi vs. non-sci-fi, etc. for each genre. The results of these four tests were almost the
same. For illustration, the results presented below are based on the comparison of sci-fi vs.
non-sci-fi.</p>
        <p>As it is common to split novels into segments when applying Zeta, we also wanted to examine
the impact of the segment size on the results. So we did our tests using three segmentation
strategies: split all novels into (1) 5000-word segments, (2) 10000-word segments and (3) take
each novel as a segment without chunking. (The median length of the novels is about 46800
words.) For (1) and (2), segments shorter than 5000 or 10000 were removed from the corpus.</p>
        <p>Before comparing Zeta and Eta, we first compared the underlying values: the docP and the
devP. Again, Spearman’s correlation between the word rankings based on these two dispersion
measures was analyzed. In both corpora, the ranking correlations of the three tests with
diferent segment length are -1, -1, and -0.98, respectively. Figure 2 illustrates the relation
between docP and devP for all words in the target corpus.7 Each blue point represents a word
and the three graphs from left to right show the results of the tests on 5000-word segments,
10000-word segments and novel segments without chunking, respectively. Clearly, devP and
docP have a strong negative correlation, but the distribution of points in the three graphs from
left to right becomes increasingly dispersed. This means that the longer the novel segments
are, the less similar the word list rankings between devP and docP are.</p>
        <p>The comparison of Zeta and Eta leads to identical results. The strong negative correlations
between the word rankings in the three tests are -0.99, -0.99, and -0.85, respectively. Each blue
point in Figure 3 represents a word and the x and y axes are the Zeta and Eta scores for each
word. The three graphs from left to right show the results of tests on 5000-word segments,
10000-word segments and entire novels, respectively. We can observe that the distribution of
points gradually becomes more dispersed. This means that the longer the novel segments are,
the less similar the Zeta and Eta scores are.</p>
        <p>Comparing the top distinctive words found by Zeta and Eta for each subgenre, we can often
observe the same words, but in a diferent order. To quantify these diferences, we calculated
the token based Jaccard similarity and NLTK’s edit distance between the top ten to 500 Zeta
6The texts contained in the corpus are in-copyright texts that we are using in the framework of the “Text
and Data Mining Exception” defined in German copyright law (§60d Urhg), following the EU “Directive on
Copyright in the Digital Single Market”. While the corpus cannot be shared as it is, we plan to publish derived
features [see 22] that allow others to repeat our calculations.</p>
        <p>7The scatter plot of docP and devP of words in the comparison corpus is almost the same as that in the
target corpus, so it will not be displayed again.
and Eta words for diferent segment lengths. 8 In Figure 4, the first and the second row are the
Jaccard similarity results and the NLTK’s edit distance results, respectively. The four columns
are the results of each of the four subgenres (from left to right: highbrow, crime, sci-fi and
sentimental) taken as a target corpus. The results of both Jaccard similarity and NLTK’s edit
distance show an increasing trend. The increase of the Jaccard similarity indicates that, as the
number of top words increases, the overlap of the Zeta and Eta word lists increases gradually.
Splitting novels into shorter segments leads to a greater overlap. In contrast to this result, the
increase of the NLTK’s edit distance shows that the words are ranked more diferently with
the increase of the number of top words. These observations also prove our earlier point: the
shorter the segments, the more words have the same or similar rank in both lists.</p>
        <p>
          8The Jaccard similarity [see 16] calculates the size of the intersection divided by the size of the union of two
word lists without considering the ranking of words. Larger values indicate a greater overlap between the top
Zeta and Eta words. In contrast to the Jaccard similarity, the NLTK’s edit distance (https://www.nltk.org/api/
nltk.metrics.html#nltk.metrics.distance.edit_distance, see Levenshtein edit-distance, [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]) takes the ranking of
words into consideration and counts the number of words that need to be substituted, inserted, or deleted, to
transform one list into another. Larger values indicate a greater diference between the Zeta and Eta word lists.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Interpretation of the word lists</title>
        <p>overlapping words in the top 10 words. The top 30 Zeta words, however, contain more of the
highly ranked Eta words than vice versa.</p>
        <p>If we compare the two Zeta word lists in Figures 5 and 7, we notice that the Zeta words
do not change much with the increased segment length: There are three new words in the
top ten list, “level”, “base” and “hundred”, whereas the words “human”, “brain”, “planet”,
“universe”, “number”, “system” and “emit” can already be found in the first Zeta word list,
which indicates a certain consistency. The Eta word list in turn displays more new distinctive
words (“civilisation”, “level”, “complex”, “hundred”, “computer”, “function”, “electronic”).
However, the words of both lists can be assigned to the previously defined semantic categories
(Figure 8).</p>
        <p>Figure 9 shows the word lists of our third analysis, where a whole novel represents a segment.</p>
        <p>It is noticeable that there is no intersection between the words of both lists; only two of the
top ten words of each list can be found in the other, namely under the top 25 (Eta rank 14:
“concept”; Eta rank 23: “nuclear” / Zeta rank 19: “chemical”; Zeta rank 14: “functioning”).</p>
        <p>While the Zeta list contains words like “humanity”, “civilization”, “space”, “orbit”, “earthly”,
“computer”, “electronic” and “robot”, which seem to fit into the previously established
semantic categories and represent more general terms from everyday language, the Eta words like
“diameter” or “vertebral” are more specific and sophisticated and open up further semantic
categories from the fields of science (Figure 10). This tendency of extracting more new specific
words by Eta becomes even stronger when the segment length increases up to novel length,
while the Zeta words stay more general. As Eta words seem more specific, our assumption is
that they should be less frequent than the Zeta words in a much larger corpus. To verify this,
we checked the frequency of the top Zeta and Eta words in the French Wikipedia.9 Figure 11
9The frequency of words in Wikipedia are obtained from http://redac.univ-tlse2.fr/corpora/wikipedia_en.
shows that the top (10, 50 and 100) Zeta words are indeed more frequent and therefore less
specific than the Eta words. This efect is stronger, the longer the segment length is.
html. If a word doesn’t exist in the frequency table, the frequency is set to 0.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and future work</title>
      <p>This paper presents a comparison of two measures of distinctiveness, Zeta and Eta. The results
show that on the statistical level, both of them have a very strong negative correlation, despite
their diferent basis for calculation. Another observation is that the correlation between Zeta
and Eta is stronger when novels are divided into shorter segments. We obtain the weakest
correlation when novels are not split into segments at all. This correlation is also reflected in
the word lists: the shorter the segments, the more similar the word lists and vice versa. The
calculation of the Jaccard similarity allowed us to observe the following trend: The Jaccard
similarity decreases, when the segment length increases.</p>
      <p>The observed similarities concern word rankings as well: We observe not only (almost)
the same words in the top ten ranking when calculating with small segments, but the
wordrankings are also almost the same in both word lists. The calculation of the NLTK’s edit
distance between word lists verified our observation: The distance between the word-rankings
increases when the segment length increases.</p>
      <p>A qualitative interpretation of the word lists confirmed the statistical observations. Both
measures are able to identify relevant interpretable distinctive words in a target corpus. There
is no need to use stop words or to prune proper names: Both dispersion-based measures mark
content words as distinctive. It seems that when the segment length increases, the Zeta words
remain content-related and more general, while the Eta words also remain content-related, but
become more specific. We are going to investigate this phenomenon in further tests.</p>
      <p>In the future, we plan to deepen our understanding of distinctiveness measures even further.
Our next steps are to test the measures on larger and more varied corpora and make more
experiments with segment length. We are also planning to include other distinctiveness measures
in our framework, such as Kullback-Leibler Divergence, Wilcoxon signed-rank test or T-test.
One point to emphasize is that the qualitative interpretation of the word lists may seem very
subjective and it looks more like an exploration than an evaluation. This is inevitable, because
as far as we know, a widely accepted robust method for a qualitative evaluation in this area
is still lacking. Therefore, we will work on developing new evaluation strategies for these
measures, in order to explore the advantages and disadvantages of each of these measures and to
ifnd out for which purpose they should be used.</p>
    </sec>
    <sec id="sec-5">
      <title>Author contributions</title>
      <p>All authors contributed to the conceptualization of the research, investigation, formal analysis,
writing the original draft and editing and reviewing the text. Specific additional contributions:
KD contributed to project administration, software development, visualisation and
methodology. JD contributed to data curation and software development. CR contributed validation.
CS contributed to data curation, software development, funding acquisition and supervision.
Author order is alphabetical. All authors gave final approval for publication and agree to be
held accountable for the work performed therein.10
10See https://casrai.org/credit.
[25] M. Scott. “PC Analysis of Key Words and Key Key Words”. In: System 25.2 (1997),
pp. 233–245.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Anthony</surname>
          </string-name>
          . “
          <article-title>AntConc: Design and development of a freeware corpus analysis toolkit for the technical writing classroom”</article-title>
          .
          <source>In: 2005</source>
          , pp.
          <fpage>729</fpage>
          -
          <lpage>737</lpage>
          . doi:
          <volume>10</volume>
          .1109/ipcc.
          <year>2005</year>
          .
          <volume>1494244</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Burrows</surname>
          </string-name>
          . “
          <article-title>All the Way Through: Testing for Authorship in Diferent Frequency Strata”</article-title>
          .
          <source>In: Literary and Linguistic Computing 22.1</source>
          (
          <issue>2007</issue>
          ), pp.
          <fpage>27</fpage>
          -
          <lpage>47</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqi067. url: http://llc.oxfordjournals.org/content/22/1/27.abstract.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Craig</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Kinney</surname>
          </string-name>
          , eds. Shakespeare, Computers, and the Mystery of Authorship. 1st ed. Cambridge University Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Egbert</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Biber</surname>
          </string-name>
          . “
          <article-title>Incorporating text dispersion into keyword analyses”</article-title>
          .
          <source>In: Corpora 14.1</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>77</fpage>
          -
          <lpage>104</lpage>
          . doi:
          <volume>10</volume>
          .3366/cor.
          <year>2019</year>
          .
          <volume>0162</volume>
          . url: https://www.euppublishing. com/doi/abs/10.3366/cor.
          <year>2019</year>
          .
          <volume>0162</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fankhauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Knappen</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Teich.</surname>
          </string-name>
          “
          <article-title>Exploring and Visualizing Variation in Language Resources”</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)</source>
          . Ed. by
          <string-name>
            <given-names>N.</given-names>
            <surname>Calzolari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choukri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Declerck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Loftsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maegaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mariani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Odijk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Piperidis</surname>
          </string-name>
          . Reykjavik, Iceland:
          <source>European Language Resources Association (ELRA)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gabrielatos</surname>
          </string-name>
          . “
          <article-title>Keyness Analysis: nature, metrics and techniques”</article-title>
          . In: Corpus Approaches to Discourse: A Critical
          <string-name>
            <surname>Review</surname>
          </string-name>
          (
          <year>2018</year>
          ), pp.
          <fpage>225</fpage>
          -
          <lpage>258</lpage>
          . url: https://research. edgehill.ac.uk/en/publications/keyness
          <article-title>-analysis-nature-metrics-and-techniques-2.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gries</surname>
          </string-name>
          . “
          <article-title>A new approach to (key) keywords analysis: Using frequency, and now also dispersion”</article-title>
          .
          <source>In: Research in Corpus Linguistics</source>
          <volume>9</volume>
          (
          <year>2021</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          . doi:
          <volume>10</volume>
          .32714/ricl.09. 02.02.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Gries</surname>
          </string-name>
          . “
          <article-title>Dispersions and adjusted frequencies in corpora”</article-title>
          .
          <source>In: 2008. doi: 10.1075/ ijcl.13.4</source>
          .02gri.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Hoover</surname>
          </string-name>
          . “
          <article-title>Teasing out Authorship and Style with t-tests and Zeta”</article-title>
          . In: Digital Humanities Conference. London,
          <year>2010</year>
          . url: http://dh2010.cch.kcl.ac.uk/academicprogramme/abstracts/papers/html/ab-658.html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kilgarrif</surname>
          </string-name>
          . “
          <article-title>Comparing word frequencies across corpora: Why chi-square doesn't work, and an improved LOB-Brown comparison”</article-title>
          .
          <source>In: ALLC-ACH Conference</source>
          .
          <year>1996</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Klimek</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Müller</surname>
          </string-name>
          . “
          <article-title>Vergleich als Methode? Zur Empirisierung eines philologischen Verfahrens im Zeitalter der Digital Humanities [Abstract]”</article-title>
          .
          <source>In: JLT Articles 9.1</source>
          (
          <year>2015</year>
          ). url: http://www.jltonline.de/index.php/articles/article/view/758.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lijffijt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nevalainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Säily</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papapetrou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Puolamäki</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          . “
          <article-title>Significance testing of word frequencies in corpora”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 31.2</source>
          (
          <issue>2014</issue>
          ), pp.
          <fpage>374</fpage>
          -
          <lpage>397</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqu064. url: http://dsh.oxfordjournals.org/ lookup/doi/10.1093/llc/fqu064.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Lyne</surname>
          </string-name>
          . “
          <article-title>Dispersion”</article-title>
          .
          <source>In: The Vocabulary of French Business Correspondence: Word Frequencies</source>
          , Collocations and Problems of Lexicometric Method. Paris: Slatkine,
          <year>1985</year>
          , pp.
          <fpage>101</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Navarro.</surname>
          </string-name>
          “
          <article-title>A guided tour to approximate string matching”</article-title>
          .
          <source>In: ACM Computing Surveys 33.1</source>
          (
          <issue>2001</issue>
          ), pp.
          <fpage>31</fpage>
          -
          <lpage>88</lpage>
          . doi:
          <volume>10</volume>
          .1145/375360.375365. url: https://dl.acm.org/doi/ 10.1145/375360.375365.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>M. L. Newman</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          <string-name>
            <surname>Groom</surname>
            ,
            <given-names>L. D.</given-names>
          </string-name>
          <string-name>
            <surname>Handelman</surname>
            , and
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Pennebaker</surname>
          </string-name>
          . “
          <article-title>Gender diferences in language use: An analysis of 14,000 text samples”</article-title>
          .
          <source>In: Discourse Processes 45.3</source>
          (
          <issue>2008</issue>
          ), pp.
          <fpage>211</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Niwattanakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Singthongchai</surname>
          </string-name>
          , E. Naenudorn, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Wanapu</surname>
          </string-name>
          . “
          <article-title>Using of Jaccard coefficient for keywords similarity”</article-title>
          .
          <source>In: Proceedings of the international multiconference of engineers and computer scientists</source>
          . Vol.
          <volume>1</volume>
          .
          <year>2013</year>
          , pp.
          <fpage>380</fpage>
          -
          <lpage>384</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Oakes</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Farrow</surname>
          </string-name>
          . “
          <article-title>Use of the Chi-Squared Test to Examine Vocabulary Diferences in English Language Corpora Representing Seven Diferent Countries”</article-title>
          .
          <source>In: Literary and Linguistic Computing 22.1</source>
          (
          <issue>2007</issue>
          ), pp.
          <fpage>85</fpage>
          -
          <lpage>99</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fql044. url: https://academic.oup.com/dsh/article/22/1/85/1025876.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Paquot</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bestgen</surname>
          </string-name>
          . “
          <article-title>Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction”</article-title>
          . In: Corpora: Pragmatics and Discourse. Ed. by
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Jucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schreier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hundt</surname>
          </string-name>
          . Brill | Rodopi,
          <year>2009</year>
          . doi:
          <volume>10</volume>
          .1163/ 9789042029101 \ _014. url: https : / / brill . com / view / book / edcoll / 9789042029101 / B9789042029101-s014.xml.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pojanapunya</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Todd</surname>
          </string-name>
          . “
          <article-title>Log-likelihood and odds ratio: Keyness statistics for diferent purposes of keyword analysis”</article-title>
          .
          <source>In: Corpus Linguistics and Linguistic Theory</source>
          <volume>14</volume>
          .1 (
          <issue>2018</issue>
          ), pp.
          <fpage>133</fpage>
          -
          <lpage>167</lpage>
          . doi:
          <volume>10</volume>
          .1515/cllt- 2015- 0030. url: https://www.degruyter. com/view/journals/cllt/14/1/article-p133.
          <fpage>xml</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rayson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. N.</given-names>
            <surname>Leech</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hodges</surname>
          </string-name>
          . “
          <article-title>Social diferentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus”</article-title>
          .
          <source>In: International Journal of Corpus Linguistics 2.1</source>
          (
          <issue>1997</issue>
          ), pp.
          <fpage>133</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schöch</surname>
          </string-name>
          . “
          <article-title>Zeta für die kontrastive Analyse literarischer Texte</article-title>
          . Theorie, Implementierung, Fallstudie”. In: Quantitative Ansätze in den Literatur- und
          <string-name>
            <surname>Geisteswissenschaften</surname>
          </string-name>
          .
          <article-title>Systematische und historische Perspektiven</article-title>
          . Ed. by
          <string-name>
            <given-names>T.</given-names>
            <surname>Bernhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Richter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lepper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Willand</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          . Berlin: de Gruyter,
          <year>2018</year>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>94</lpage>
          . url: https:// www.degruyter.com/view/books/9783110523300/9783110523300- 004/
          <fpage>9783110523300</fpage>
          -
          <lpage>004</lpage>
          .xml.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schöch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Döhl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rettinger</surname>
          </string-name>
          , E. Gius,
          <string-name>
            <given-names>P.</given-names>
            <surname>Trilcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Leinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jannidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hinzmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Röpke</surname>
          </string-name>
          . “Abgeleitete Textformate:
          <article-title>Text und Data Mining mit urheberrechtlich geschützten Textbeständen”</article-title>
          . In:
          <article-title>Zeitschrift für digitale Geisteswissenschaften (ZfdG) 5 (</article-title>
          <year>2020</year>
          ). doi: http://dx.doi.org/10.17175/
          <year>2020</year>
          \_006. url: http://www.zfdg. de/
          <year>2020</year>
          %5C%
          <fpage>5F006</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schöch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schlör</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zehe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gebhard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hotho</surname>
          </string-name>
          . “Burrows'
          <article-title>Zeta: Exploring and Evaluating Variants and Parameters”</article-title>
          .
          <source>In: Book of Abstracts of the Digital Humanities Conference. Mexico City: Adho</source>
          ,
          <year>2018</year>
          . url: https : / / dh2018 . adho .
          <article-title>org / burrows-zeta-exploring-and-evaluating-variants-and-parameters/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schröter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dudar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rok</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Schöch</surname>
          </string-name>
          . “
          <article-title>From Keyness to Distinctiveness - Triangulation and Evaluation in Computational Literary Studies”</article-title>
          . In:
          <article-title>Journal of Literary Theory (JLT) ().</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>