<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Computational Humanities Research Conference, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Predicting Canonization: Comparing Canonization Scores Based on Text-Extrinsic and -Intrinsic Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Judith Brottrager</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annina Stahl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arda Arslan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ETH Zurich, Social Networks Lab</institution>
          ,
          <addr-line>Weinbergstraße 109, 8092 Zurich</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technical University of Darmstadt, Institute of Linguistics and Literary Studies</institution>
          ,
          <addr-line>Dolivostraße 15, 64293 Darmstadt</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <issue>4</issue>
      <fpage>7</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>The majority of literary texts ever written are hardly known, read, or studied today, and belong to the so-called “Great Unread”. Theories of canonization predominantly focus on sociocultural processes of selection which culminate in the formation of a canon, but say little about how the texts themselves contribute to canonization. In this paper, we propose an operationalization for canonization, which is then used to build a classifier that predicts a canonization score for a text by considering text-intrinsic features only. Working on a historical corpus of English and German texts, which includes both canonical and “unread” works, the results show that a canonization score based on text-inherent features has weak correlations with a canonization score based on text-extrinsic features.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;literary texts</kwd>
        <kwd>canonization</kwd>
        <kwd>text classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A major promise of computational literary studies has been the inclusion of the so-called “Great
Unread”, i.e. those texts that have been previously underrepresented in literary history and
which are hardly known, read, or studied today. Large-scale analyses have shown, however,
that the argumentative strength of “distant reading” approaches [
        <xref ref-type="bibr" rid="ref13">14</xref>
        ] does not lie in
including all of what Algee-Hewitt et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] call the “archive”, i.e. all published texts preserved in
libraries and archives, but in contextualizing the available sample and the population in
question [
        <xref ref-type="bibr" rid="ref18">19</xref>
        ]. One way of contextualizing the relationship between the “Great Unread”—in other
words non-canonical texts—and highly canonical works is to explicitly address their degree of
canonization.
      </p>
      <p>
        The canon as such is not easily definable or even palpable: De- and recanonizations of
texts prove that their canonical status is not fixed, but changes over time. Additionally, the
criteria for evaluating literature vary between genres [
        <xref ref-type="bibr" rid="ref15 ref8">16, 8</xref>
        ] and over time [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and can thus
not be universally operationalized. Recent contributions in the field of canon studies stress the
complex combination of selection and interpretation processes which are influenced by both
literary and non-literary factors [
        <xref ref-type="bibr" rid="ref15">16</xref>
        ]. Canonization is seen as the result of an interplay between
sociocultural, discursive, and institutional powers, which is only partially understood [
        <xref ref-type="bibr" rid="ref21">22</xref>
        ].
      </p>
      <p>
        These theories of canonization usually neglect the aspect of literary quality and there is
almost no research on the textual aspects of literary judgment [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Winko [
        <xref ref-type="bibr" rid="ref21">22</xref>
        ] points to
the fact that even though the question whether canonical texts share certain textual features
that diferentiate them from others is hardly addressed, literary scholars often implicitly name
certain qualities of texts as the reason for their status in the canon. Similarly, computational
literary approaches have often not discussed canonization in direct relationship with textual
features, but have either been limited to metadata [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or have addressed related concepts of
distinction in the literary market, as, for example, popularity (see 2).
      </p>
      <p>
        Our goal is to examine the relationship between a canonization score based on text-extrinsic
features, as, for example, available editions and references in secondary literature, and a
corresponding score based on text-intrinsic features only. For this, each text in our corpus comprising
both canonized and non-canonized texts is assigned a score which reflects the likelihood of the
text belonging to the canon, which is also used for the evaluation of our models. If the
classiifcation based on text-inherent features is successful, then we can assume that the canonization
of texts is, as Winko [
        <xref ref-type="bibr" rid="ref21">22</xref>
        ] assumes, to a certain degree linked to textual qualities.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Previous Work</title>
      <p>
        As mentioned above, existing quantitative studies either investigate canonization in metadata
analyses by modeling indicators of prestige and popularity [
        <xref ref-type="bibr" rid="ref1 ref19">20, 1</xref>
        ] or by looking at related
concepts of distinction. For example, Cranenburgh, Dalen-Oskam and Zundert [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] use textual
features to predict the literariness of modern Dutch novels, which was previously determined
in a comprehensive survey of readers. They find that a model that combines document
embeddings and topic modeling best predicts literariness. Their results further indicate that novels
that are perceived as especially literary tend to deviate from the norm by having higher
semantic complexity, and that these literary novels use certain words and topics more frequently.
Working with the same dataset, Jautze, Cranenburgh and Koolen [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] use topic modeling [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and show that novels that are perceived as highly literary tend to use a more diverse range of
topics. Ashok, Feng and Choi [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] predict literary success, measured by download counts from
Project Gutenberg, discovering that some style characteristics only explain literary success for
some genres, while others are universal indicators, and find some evidence that readability and
literary success are negatively correlated.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Corpus</title>
      <p>
        Our corpus comprises 1, 153 novels and narratives in English and German (606 and 547,
respectively), covering the Long 18th and the Long 19th century of British and German language
literary history. This time span from 1688–1914 avoids culture-specific temporal limits and
encompasses great changes in literary production and consumption. We expect this comparative
bilingual approach to be especially productive because canonization processes difer
significantly between these two literary traditions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. During corpus preparation, we systematically
adapted an approach proposed by Algee-Hewitt and McGurl [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which aims at achieving
representativeness for literary corpora by moving from a “found corpus” to a “made” list of text,
working with best-of and bestseller lists and expert surveys. By doing so, Algee-Hewitt and
McGurl [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] combine three diferent tiers of the literary production: a more exclusive canon,
popular and financially successful texts, and a more diverse group of works added at the
suggestion of experts in Postcolonial and Feminist Studies. Analogously, we identified secondary
sources, narrative literary histories, anthologies, and more specialised academic monographs
that represent these tiers of literary production and used them as bibliographies for our corpus.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Canonization Score</title>
      <p>In order to be able to take canonization into account, we had to find a reliable
operationalization. As neither the canon nor canonization processes are fixed or agreed upon, we have
decided to implement a canonization score which reflects the likelihood of a text belonging to
the canon. By defining the canonization score as a likelihood, we account for the flexibility of
canon formations.</p>
      <p>
        Based on theoretical background provided by Heydebrand and Winko [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we formalized the
following characteristics of a canonized text: A text is more likely to be canonized if (1) there is
an edition of the complete or collected works of the author, (2) student editions of the text are
available, and if it is mentioned in (3) exclusive narrative literary histories and anthologies and
(4) other academic literature.2 For the computation itself, we made use of the bibliographical
information we had gathered during the corpus compilation (for example, the number of times
a specific work was mentioned in exclusive literary histories) and additional data taken from
the respective national bibliographies and selected publishing houses.
      </p>
      <p>Building on the conceptualization of the canonization score as a likelihood, we then identified
minimum and maximum values, i.e. those texts that are extremely unlikely or extremely likely
to be considered to be canonized. For the minima, we again used the bibliographic information
collected during the corpus compilation to identify those texts that were mentioned in only one
highly specialized secondary source. These specialized sources, which deal with literature by
marginalised authors and genres, reference texts that are likely to be known and read only by
an expert audience, which makes it extremely unlikely for them to be canonized. In contrast
to the mass of completely forgotten texts, however, they are at least in some form remembered
and available. To represent this diference in the score, we set the minimum score to 0.05.
The maxima were defined with the help of university reading lists: the more often a text was
mentioned on diferent reading lists, the higher its score. In our final model, works referenced
on more than 60% of reading lists were attributed a score of 1.0, those mentioned on between
30-60% a score of 0.8, and all others with at least one mention a score of 0.6. Following this
1As expected, not all texts listed were already digitized and we retro-digitized some of the missing texts
ourselves, focusing on adding more diversity to the corpus (by adding not yet represented authors, female
authors, authors from geographical peripheries, and niche genres).</p>
      <p>
        2Algee-Hewitt et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] similarly formalize diferent canon notions by relying on entries in the Dictionary of
National Biography, the MLA Bibliography, and Stanford PhD exam lists.
determination of our training data, we trained a logistic regression model (which again reflects
the score’s conceptualization as a likelihood) on these texts and their respective canonization
features, i.e. the four characteristics mentioned above plus a count representing the number of
references to the text in highly specialized secondary sources.
      </p>
      <p>All features used have a significant impact on the discrimination between canonized and
non-canonized texts. Table 1 summarizes how a text’s odds of being canonized change per
unit. For binary characteristics, as, for example, an existing student edition, this means that
the odds of being canonized are 2.04 to 1 if a text is available in such an edition. For counts,
as for example the mentions in exclusive literary histories, each reference raises the odds by
1.98.</p>
      <p>The resulting model was then used for the prediction of the canonization scores for the
entire corpus. Figure 1 shows the scores for all texts and depicts how they are determined
by the existence of complete/collected works or student editions and the number of references
in exclusive literary histories; the points are transparent so that clusterings and overlaps are
identifiable. Overall, the upper end of the scale is dominated by texts by established and
wellresearched authors (as they are published in complete/collected works), that are also likely to
be taught in schools and at universities (as they are published as student editions). The lower
range of the scale is dominated by texts which are not part of a narrowly defined literary history
(as they are not mentioned in exclusive literary histories) and whose authors are under-studied.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Methods</title>
      <sec id="sec-5-1">
        <title>5.1. Approach</title>
        <p>Having established a measure of the degree of canonization for each text in our corpus, we
used these scores as the ground truth for a model that predicts canonization scores solely from
textual features.</p>
        <p>We extracted a range of features from our texts that cover diferent aspects of style and
content. In a preparatory step, we converted all texts to lowercase and replaced
Germanspecific characters. In order to increase the number of data points on which we would train the
model, we split the documents into chunks of 200 sentences using spaCy, a library for natural
language processing, for sentence tokenization. Features were then calculated for either chunks
or full documents, depending on the nature of the feature. In section 5.2, feature extraction is
explained in more detail.</p>
        <p>
          Using Support Vector Regession (SVR) as the regression model, we tested several
combinations of features and dimensionality reduction techniques for each language separately with a
10-fold cross-validation. All works of an author were part of the same fold in order to avoid
overfitting to an author’s characteristics. We selected the model with the highest Pearson
correlation coefficient (Pearson’s r) between the canonization scores and the predicted scores.
The p-value of the correlation coefficient was calculated by taking the harmonic mean of the
p-values of the folds [
          <xref ref-type="bibr" rid="ref20">21</xref>
          ]. We included chunk- and document-level features both separately and
in combination, either by adding the document-level features to each chunk, or by taking the
average of all chunks per document. For dimensionality reduction, we tried PCA, including
enough components so that 95% of the variance was explained, as well as SelectKBest from
scikit-learn with either mutual information regression or F-regression as the scoring function,
and retained 10% of features.
        </p>
        <p>In addition to running the classification with all texts, we conducted two experiments with
the texts that served as training data for the canonization scores. In our first approach,
we trained the model on all texts and validated it only on these non-canonized and
highlycanonized cases, and in the second approach, we both trained and validated the model on the
extreme cases.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Features</title>
        <p>
          We used a wide range of established features from micro- to macro-textual levels, covering
character-based, lexical, and semantic characteristics.3 Starting on the level of individual
characters, the feature set comprises the ratio of various special characters, as, for example,
punctuation marks. On the lexical level, we have included the tf-idf4 of a word if it occurs
in at least 10% of documents and is among the 30 words with the highest tf-idf for at least
one document, as well as n-gram-based features, such as the 100 most frequent uni-, bi-, and
trigrams, and the ratio of all unique uni-, bi-, and trigrams and their entropy. The
typetoken ratio and the ratio of stopwords are proxies for a chunk’s lexical diversity, the Flesch
reading ease score [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for its readability. Additionally, the average word and paragraph length,
the text length of a chunk of 200 sentences, and the average and maximum number of words
per sentence are used as features. We also created a doc2vec embedding [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] for each chunk,
treating chunks as separate documents,
        </p>
        <p>
          For the modeling of semantic complexity, we implemented four variations of distances
between chunks, which were introduced by Cranenburgh, Dalen-Oskam and Zundert [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The
document vectors obtained by embedding techniques are interpreted as points in a vector space
so that the Euclidean distance can be calculated between them. Representing in-text semantic
3An overview of all features used can be found in the appendix.
4Term frequency - inverse document frequency
similarity,5 semantic coherence,6 semantic similarity to other texts,7 and semantic overlap with
other texts,8 these variations enable a diversified look at text similarities. We calculated all four
semantic complexity measures with both doc2vec and Sentence-BERT (SBERT) embeddings,
which is a modification of BERT that better captures semantic similarity between sentences
[
          <xref ref-type="bibr" rid="ref14">15</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results and Discussion</title>
      <p>For both languages, using PCA for dimensionality reduction yielded the highest correlation
coefficients in the cross-validation. 9 For the English texts, the combination of text-level features
and averages across all chunks performed best with a Pearson’s r of 0.242, and chunk-level
features delivered the best results on the German texts with a r of 0.285. Table 2 shows the
results for using SVR, PCA, and diferent feature levels, and Figure 2 shows the canonizations
scores versus the predicted scores. Limiting the training and/or test data to only non-canonized
and highly canonized texts produced correlation coefficients that were similar to those from
the full dataset. This can be seen as an indication that the canonization scores inferred for the
texts between the extreme cases are reliable.</p>
      <p>These weak correlations between the predicted and the actual canonization scores lead to the
conclusion that a model of canonization based on text-extrinsic features and a model based
on text-intrinsic features are only weakly interconnected. A closer look at some examples
shows, however, that some interesting systematic shifts between the models can be observed:
While for both the English and the German corpus, texts with the 10% highest canonization
scores were on average published during the first half and middle of the 19 th century (1853
and 1834, respectively), texts with the 10% highest predicted scores center around 1873 in
the English and 1805 in the German corpus. This can be seen as an indication that what is
actually captured is the closeness to central literary periods: Texts written in the Victorian
Age dominate the highest predicted scores for the English corpus; texts from the Goethezeit
(1770-1830) those for the German corpus.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>Building upon the theoretical framework of canonization theories and previous studies focusing
on indicators of literary distinction, our approach ofers a quantitative operationalization of
5Intra-textual variance is calculated by summing over the distances between the document chunks and the
centroid, which is obtained by averaging over the chunk vectors. A high intra-textual variance means that the
chunks are very semantically dissimilar to each other.</p>
      <p>6Stepwise distance is similar to intra-textual variance, but instead of calculating the distance of chunks from
the centroid, the distance between consecutive chunks is calculated, indicating how rapidly a text’s semantic
content changes.</p>
      <p>7Inter-textual variance is measured with the outlier score, which is the distance of a text, represented as the
centroid of its constituting chunks, to its nearest neighbour.</p>
      <p>8The overlap score measures which fraction of the k nearest neighbours of a text’s centroid are chunks that
belong to that very text, with k being the number of chunks in the text.</p>
      <p>9In order to allow for an evaluation of feature contribution, we also included SelectKBest from scikit-learn
in the cross-validation, which assigns a score to each feature using a scoring function, and then only keeps the
k features with the highest score. However, it performed worse than PCA in terms of correlation, so we chose
a model with PCA instead.
canonization and some initial analyses of the relationship between a metadata-based
conceptualization of canonization and text-inherent features. Overall, our results indicate that this
relationship is very limited. There are, however, some trends on a smaller scale that call for
more detailed analyses.</p>
      <p>In the next stage of our project, we will focus on those texts whose text-extrinsic and intrinsic
canonization scores difer widely. By doing so, we will be able to further investigate the patterns
of deviations. This step will also include an evaluation of the implemented features and an
analysis of their individual impact on the predictions.</p>
      <p>
        Moreover, dividing the texts into cohorts based on the publication date will help us explore
the diference between similarity to other canonized texts and canonization itself, as this would
level out the dominance of certain periods in the canon. These cohorts would also allow for a
more theory-based description of canonization processes, because, as Heydebrand and Winko
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] have shown, value judgments and evaluative systems are highly adaptive and flexible.
      </p>
      <p>
        On a methodological level, our approach could be improved by adding features that require
more complex language processing, as, for example, Ashok, Feng and Choi [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] have done by
including the distribution of part-of-speech tags, syntactic production rules, or sentiments.
Finally, as we are working on historical texts from 1688-1914, the language models would have
to be trained on or adapted for historical language.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is part of ’Relating the Unread. Network Models in Literary History’, a project
supported by the German Research Foundation (DFG) through the priority programme SPP
2207 Computational Literary Studies (CLS). Special thanks to Ulrik Brandes and Thomas
Weitin for their feedback and support and to our anonymous reviewers for their invaluable
input and suggestions.
[13] M. M. Mirończuk and J. Protasiewicz. “A recent overview of the state-of-the-art elements
of text classification”. In: Expert Systems with Applications 106 (2018), pp. 36–54. doi:
10.1016/j.eswa.2018.03.058.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Overview of Features</title>
      <p>(a) English, all documents,
book + average chunk features
(b) German, all documents,
chunk features
(c) English, full training and reduced test data,
book + average chunk features
(d) German, full training and reduced test
data, chunk features
(e) English, reduced training and test data,
book + average chunk features
(f) German, reduced training and test data,
chunk features</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Algee-Hewitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Allison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gemma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Heuser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Moretti</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Walser</surname>
          </string-name>
          . “Canon/Archive.
          <article-title>Large-scale Dynamics in the Literary Field”</article-title>
          .
          <source>In: Pamphlets of the Stanford Literary Lab</source>
          <volume>11</volume>
          (
          <year>2016</year>
          ). url: https://litlab.stanford.edu/LiteraryLabPamphlet11.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Algee-Hewitt</surname>
          </string-name>
          and
          <string-name>
            <surname>M. McGurl.</surname>
          </string-name>
          “
          <article-title>Between Canon and Corpus: Six Perspectives on 20thCentury Novels”</article-title>
          .
          <source>In: Pamphlets of the Stanford Literary Lab. Pamphlets of the Stanford Literary Lab</source>
          <volume>8</volume>
          (
          <year>2015</year>
          ). url: http://litlab.stanford.edu/LiteraryLabPamphlet8.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ashok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          . “
          <article-title>Success with style: Using writing style to predict the success of novels”</article-title>
          .
          <source>In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</source>
          . Ed.
          <article-title>by A. for Computational Linguistics</article-title>
          . Seattle, Washington,
          <year>2013</year>
          , pp.
          <fpage>1753</fpage>
          -
          <lpage>1764</lpage>
          . url: https://aclanthology.org/D13-1181.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bentz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Alikaniotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cysouw</surname>
          </string-name>
          and
          <string-name>
            <surname>R.</surname>
          </string-name>
          <article-title>Ferrer-i-Cancho. “The Entropy of WordsLearnability and Expressivity across More than 1000 Languages”</article-title>
          .
          <source>In: Entropy 19.6</source>
          (
          <issue>2017</issue>
          ), p.
          <fpage>275</fpage>
          . doi:
          <volume>10</volume>
          .3390/e19060275.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          and
          <string-name>
            <surname>M. I. Jordan.</surname>
          </string-name>
          “
          <article-title>Latent Dirichlet Allocation”</article-title>
          .
          <source>In: Journal of Machine Learning Research</source>
          <volume>3</volume>
          (
          <year>2003</year>
          ), pp.
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          . url: https://www.jmlr.org/papers/ volume3/blei03a/blei03a.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>A. van Cranenburgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            van
            <surname>Dalen-Oskam</surname>
          </string-name>
          and
          <string-name>
            <surname>J. van Zundert. “</surname>
          </string-name>
          <article-title>Vector space explorations of literary language”</article-title>
          .
          <source>In: Language Resources and Evaluation 53.4</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>625</fpage>
          -
          <lpage>650</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10579-018-09442-4.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Flesch</surname>
          </string-name>
          . “
          <article-title>A new readability yardstick”</article-title>
          .
          <source>In: The Journal of Applied Psychology</source>
          <volume>32</volume>
          (3
          <year>1948</year>
          ), pp.
          <fpage>221</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Freise</surname>
          </string-name>
          . “Textbezogene Modelle:
          <article-title>Ästhetische Qualität als Maßstab der Kanonbildung”</article-title>
          . In: Handbuch Kanon und Wertung: Theorien, Instanzen, Geschichte. Ed. by Rippl,
          <source>Gabriele and Winko</source>
          , Simone. Stuttgart,
          <string-name>
            <surname>Weimar: J.B.Metzler</surname>
          </string-name>
          ,
          <year>2013</year>
          , pp.
          <fpage>50</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. von</given-names>
            <surname>Heydebrand</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Winko</surname>
          </string-name>
          .
          <article-title>Einführung in die Wertung von Literatur</article-title>
          . Paderborn, München, Wien, Zürich: Schöningh,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Jautze</surname>
          </string-name>
          , A. van Cranenburgh and
          <string-name>
            <given-names>C.</given-names>
            <surname>Koolen</surname>
          </string-name>
          . “
          <article-title>Topic Modeling Literary Quality”</article-title>
          .
          <source>In: Digital Humanities</source>
          <year>2016</year>
          :
          <article-title>Conference abstracts</article-title>
          . Ed. by
          <string-name>
            <given-names>M.</given-names>
            <surname>Eder</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Rybicki</surname>
          </string-name>
          . Jagiellonian University &amp; Pedagogical University, Kraków,
          <year>2016</year>
          . url: https://dh2016.adho. org/abstracts/95.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lagutina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lagutina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Boychuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Vorontsova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shliakhtina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Belyaeva</surname>
          </string-name>
          , I. Paramonov and
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Demidov.</surname>
          </string-name>
          “
          <article-title>A Survey on Stylometric Text Features”</article-title>
          .
          <source>In: 25th Conference of Open Innovations Association (FRUCT)</source>
          . Helsinki,
          <year>2019</year>
          , pp.
          <fpage>184</fpage>
          -
          <lpage>195</lpage>
          . doi:
          <volume>10</volume>
          .23919/fruct48121.
          <year>2019</year>
          .
          <volume>8981504</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          . “
          <article-title>Distributed Representations of Sentences and Documents”</article-title>
          .
          <source>In: Proceedings of the 31 st International Conference on Machine Learning</source>
          . Ed. by
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Jebara</surname>
          </string-name>
          . Bejing, China,
          <year>2014</year>
          , pp.
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          . url: proceedings.mlr.press/ v32/le14.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Moretti</surname>
          </string-name>
          . Distant Reading. Verso,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Gurevych.</surname>
          </string-name>
          “
          <article-title>Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”</article-title>
          .
          <source>In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing</source>
          . Ed.
          <article-title>by A. for Computational Linguistics</article-title>
          .
          <source>Hong Kong</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>982</fpage>
          -
          <lpage>3992</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1410.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rippl</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Winko</surname>
          </string-name>
          . “
          <article-title>Einleitung”</article-title>
          . In: Handbuch Kanon und Wertung: Theorien, Instanzen, Geschichte. Ed. by Rippl,
          <source>Gabriele and Winko</source>
          , Simone. Stuttgart,
          <string-name>
            <surname>Weimar: J.B.Metzler</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Rippl</surname>
          </string-name>
          , Gabriele and Winko, Simone, eds. Handbuch Kanon und Wertung: Theorien, Instanzen, Geschichte. Stuttgart,
          <string-name>
            <surname>Weimar: J.B.Metzler</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <article-title>“A Survey of Modern Authorship Attribution Methods”</article-title>
          .
          <source>In: Journal of the American Society for Information Science and Technology 60.3</source>
          (
          <issue>2009</issue>
          ), pp.
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          . doi:
          <volume>10</volume>
          .1002/asi.21001.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Underwood. Distant Horizons - Digital Evidence</surname>
          </string-name>
          and
          <string-name>
            <given-names>Literary</given-names>
            <surname>Change</surname>
          </string-name>
          . The University of Chicago Press,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Underwood</surname>
          </string-name>
          and
          <string-name>
            <surname>J. Sellers.</surname>
          </string-name>
          “
          <article-title>The Longue Durée of Literary Prestige”</article-title>
          .
          <source>In: Modern Language Quarterly 77.3</source>
          (
          <year>2016</year>
          ). doi:
          <volume>10</volume>
          .1215/
          <fpage>00267929</fpage>
          -
          <lpage>3570634</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [21]
          <string-name>
            <surname>D. J. Wilson.</surname>
          </string-name>
          “
          <article-title>The harmonic mean p-value for combining dependent tests”</article-title>
          .
          <source>In: Proceedings of the National Academy of Sciences of the United States of America 116.4</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>1195</fpage>
          -
          <lpage>1200</lpage>
          . doi:
          <volume>10</volume>
          .1073/pnas.1814092116.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Winko</surname>
          </string-name>
          . “
          <article-title>Literatur-Kanon als invisible hand-Phänomen”</article-title>
          . In: Literarische Kanonbildung. Ed. by
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Arnold</surname>
          </string-name>
          . München: edition text + kritik,
          <year>2002</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>