=Paper= {{Paper |id=Vol-2989/short_paper21 |storemode=property |title=Predicting Canonization: Comparing Canonization Scores Based on Text-Extrinsic and -Intrinsic Features |pdfUrl=https://ceur-ws.org/Vol-2989/short_paper21.pdf |volume=Vol-2989 |authors=Judith Brottrager,Annina Stahl,Arda Arslan |dblpUrl=https://dblp.org/rec/conf/chr/BrottragerSA21 }} ==Predicting Canonization: Comparing Canonization Scores Based on Text-Extrinsic and -Intrinsic Features== https://ceur-ws.org/Vol-2989/short_paper21.pdf
Predicting Canonization: Comparing Canonization
Scores Based on Text-Extrinsic and -Intrinsic Features
Judith Brottrager1 , Annina Stahl2 and Arda Arslan2
1
  Technical University of Darmstadt, Institute of Linguistics and Literary Studies, Dolivostraße 15, 64293
Darmstadt
2
  ETH Zurich, Social Networks Lab, Weinbergstraße 109, 8092 Zurich


                                 Abstract
                                 The majority of literary texts ever written are hardly known, read, or studied today, and belong
                                 to the so-called “Great Unread”. Theories of canonization predominantly focus on sociocultural
                                 processes of selection which culminate in the formation of a canon, but say little about how the
                                 texts themselves contribute to canonization. In this paper, we propose an operationalization for
                                 canonization, which is then used to build a classifier that predicts a canonization score for a text by
                                 considering text-intrinsic features only. Working on a historical corpus of English and German texts,
                                 which includes both canonical and “unread” works, the results show that a canonization score based
                                 on text-inherent features has weak correlations with a canonization score based on text-extrinsic
                                 features.

                                 Keywords
                                 literary texts, canonization, text classification




1. Introduction
A major promise of computational literary studies has been the inclusion of the so-called “Great
Unread”, i.e. those texts that have been previously underrepresented in literary history and
which are hardly known, read, or studied today. Large-scale analyses have shown, however,
that the argumentative strength of “distant reading” approaches [14] does not lie in includ-
ing all of what Algee-Hewitt et al. [1] call the “archive”, i.e. all published texts preserved in
libraries and archives, but in contextualizing the available sample and the population in ques-
tion [19]. One way of contextualizing the relationship between the “Great Unread”—in other
words non-canonical texts—and highly canonical works is to explicitly address their degree of
canonization.
   The canon as such is not easily definable or even palpable: De- and recanonizations of
texts prove that their canonical status is not fixed, but changes over time. Additionally, the
criteria for evaluating literature vary between genres [16, 8] and over time [9] and can thus
not be universally operationalized. Recent contributions in the field of canon studies stress the
complex combination of selection and interpretation processes which are influenced by both
literary and non-literary factors [16]. Canonization is seen as the result of an interplay between
sociocultural, discursive, and institutional powers, which is only partially understood [22].

CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The
Netherlands
£ judith.brottrager@tu-darmstadt.de (J. Brottrager); annina.stahl@gess.ethz.ch (A. Stahl);
arda.arslan@gess.ethz.ch (A. Arslan)
DZ 0000-0002-3108-8936 (J. Brottrager); 0000-0001-5456-9815 (A. Stahl)
                               © 2021 Copyright for this paper by its authors.
                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Wor
    Pr
       ks
        hop
     oceedi
          ngs
                ht
                I
                 tp:
                   //
                    ceur
                       -
                SSN1613-
                        ws
                         .or
                       0073
                           g

                               CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                   195
   These theories of canonization usually neglect the aspect of literary quality and there is
almost no research on the textual aspects of literary judgment [8]. Winko [22] points to
the fact that even though the question whether canonical texts share certain textual features
that differentiate them from others is hardly addressed, literary scholars often implicitly name
certain qualities of texts as the reason for their status in the canon. Similarly, computational
literary approaches have often not discussed canonization in direct relationship with textual
features, but have either been limited to metadata [1] or have addressed related concepts of
distinction in the literary market, as, for example, popularity (see 2).
   Our goal is to examine the relationship between a canonization score based on text-extrinsic
features, as, for example, available editions and references in secondary literature, and a corres-
ponding score based on text-intrinsic features only. For this, each text in our corpus comprising
both canonized and non-canonized texts is assigned a score which reflects the likelihood of the
text belonging to the canon, which is also used for the evaluation of our models. If the classi-
fication based on text-inherent features is successful, then we can assume that the canonization
of texts is, as Winko [22] assumes, to a certain degree linked to textual qualities.


2. Previous Work
As mentioned above, existing quantitative studies either investigate canonization in metadata
analyses by modeling indicators of prestige and popularity [20, 1] or by looking at related
concepts of distinction. For example, Cranenburgh, Dalen-Oskam and Zundert [6] use textual
features to predict the literariness of modern Dutch novels, which was previously determined
in a comprehensive survey of readers. They find that a model that combines document embed-
dings and topic modeling best predicts literariness. Their results further indicate that novels
that are perceived as especially literary tend to deviate from the norm by having higher se-
mantic complexity, and that these literary novels use certain words and topics more frequently.
Working with the same dataset, Jautze, Cranenburgh and Koolen [10] use topic modeling [5]
and show that novels that are perceived as highly literary tend to use a more diverse range of
topics. Ashok, Feng and Choi [3] predict literary success, measured by download counts from
Project Gutenberg, discovering that some style characteristics only explain literary success for
some genres, while others are universal indicators, and find some evidence that readability and
literary success are negatively correlated.


3. Corpus
Our corpus comprises 1, 153 novels and narratives in English and German (606 and 547, re-
spectively), covering the Long 18th and the Long 19th century of British and German language
literary history. This time span from 1688–1914 avoids culture-specific temporal limits and en-
compasses great changes in literary production and consumption. We expect this comparative
bilingual approach to be especially productive because canonization processes differ signific-
antly between these two literary traditions [9]. During corpus preparation, we systematically
adapted an approach proposed by Algee-Hewitt and McGurl [2], which aims at achieving rep-
resentativeness for literary corpora by moving from a “found corpus” to a “made” list of text,
working with best-of and bestseller lists and expert surveys. By doing so, Algee-Hewitt and
McGurl [2] combine three different tiers of the literary production: a more exclusive canon,
popular and financially successful texts, and a more diverse group of works added at the sug-




                                               196
Table 1
Odds of a text being canonized per unit
                                  Feature                               Odds
                                  Complete/Collected works           2.5795504
                                  Student editions                   2.0427817
                                  Exclusive literary histories       1.9767068
                                  Academic literature                1.6519360
                                  Specialized secondary sources      0.5234495


gestion of experts in Postcolonial and Feminist Studies. Analogously, we identified secondary
sources, narrative literary histories, anthologies, and more specialised academic monographs
that represent these tiers of literary production and used them as bibliographies for our corpus.1


4. Canonization Score
In order to be able to take canonization into account, we had to find a reliable operational-
ization. As neither the canon nor canonization processes are fixed or agreed upon, we have
decided to implement a canonization score which reflects the likelihood of a text belonging to
the canon. By defining the canonization score as a likelihood, we account for the flexibility of
canon formations.
   Based on theoretical background provided by Heydebrand and Winko [9], we formalized the
following characteristics of a canonized text: A text is more likely to be canonized if (1) there is
an edition of the complete or collected works of the author, (2) student editions of the text are
available, and if it is mentioned in (3) exclusive narrative literary histories and anthologies and
(4) other academic literature.2 For the computation itself, we made use of the bibliographical
information we had gathered during the corpus compilation (for example, the number of times
a specific work was mentioned in exclusive literary histories) and additional data taken from
the respective national bibliographies and selected publishing houses.
   Building on the conceptualization of the canonization score as a likelihood, we then identified
minimum and maximum values, i.e. those texts that are extremely unlikely or extremely likely
to be considered to be canonized. For the minima, we again used the bibliographic information
collected during the corpus compilation to identify those texts that were mentioned in only one
highly specialized secondary source. These specialized sources, which deal with literature by
marginalised authors and genres, reference texts that are likely to be known and read only by
an expert audience, which makes it extremely unlikely for them to be canonized. In contrast
to the mass of completely forgotten texts, however, they are at least in some form remembered
and available. To represent this difference in the score, we set the minimum score to 0.05.
The maxima were defined with the help of university reading lists: the more often a text was
mentioned on different reading lists, the higher its score. In our final model, works referenced
on more than 60% of reading lists were attributed a score of 1.0, those mentioned on between
30-60% a score of 0.8, and all others with at least one mention a score of 0.6. Following this
    1
      As expected, not all texts listed were already digitized and we retro-digitized some of the missing texts
ourselves, focusing on adding more diversity to the corpus (by adding not yet represented authors, female
authors, authors from geographical peripheries, and niche genres).
    2
      Algee-Hewitt et al. [1] similarly formalize different canon notions by relying on entries in the Dictionary of
National Biography, the MLA Bibliography, and Stanford PhD exam lists.




                                                       197
determination of our training data, we trained a logistic regression model (which again reflects
the score’s conceptualization as a likelihood) on these texts and their respective canonization
features, i.e. the four characteristics mentioned above plus a count representing the number of
references to the text in highly specialized secondary sources.
   All features used have a significant impact on the discrimination between canonized and
non-canonized texts. Table 1 summarizes how a text’s odds of being canonized change per
unit. For binary characteristics, as, for example, an existing student edition, this means that
the odds of being canonized are 2.04 to 1 if a text is available in such an edition. For counts,
as for example the mentions in exclusive literary histories, each reference raises the odds by
1.98.




Figure 1: Distribution of canonization scores based on complete/collected works, student editions, and
mentions in exclusive literary histories


  The resulting model was then used for the prediction of the canonization scores for the
entire corpus. Figure 1 shows the scores for all texts and depicts how they are determined
by the existence of complete/collected works or student editions and the number of references
in exclusive literary histories; the points are transparent so that clusterings and overlaps are
identifiable. Overall, the upper end of the scale is dominated by texts by established and well-
researched authors (as they are published in complete/collected works), that are also likely to
be taught in schools and at universities (as they are published as student editions). The lower
range of the scale is dominated by texts which are not part of a narrowly defined literary history
(as they are not mentioned in exclusive literary histories) and whose authors are under-studied.


5. Methods
5.1. Approach
Having established a measure of the degree of canonization for each text in our corpus, we
used these scores as the ground truth for a model that predicts canonization scores solely from




                                                 198
textual features.
   We extracted a range of features from our texts that cover different aspects of style and
content. In a preparatory step, we converted all texts to lowercase and replaced German-
specific characters. In order to increase the number of data points on which we would train the
model, we split the documents into chunks of 200 sentences using spaCy, a library for natural
language processing, for sentence tokenization. Features were then calculated for either chunks
or full documents, depending on the nature of the feature. In section 5.2, feature extraction is
explained in more detail.
   Using Support Vector Regession (SVR) as the regression model, we tested several combina-
tions of features and dimensionality reduction techniques for each language separately with a
10-fold cross-validation. All works of an author were part of the same fold in order to avoid
overfitting to an author’s characteristics. We selected the model with the highest Pearson
correlation coefficient (Pearson’s r) between the canonization scores and the predicted scores.
The p-value of the correlation coefficient was calculated by taking the harmonic mean of the
p-values of the folds [21]. We included chunk- and document-level features both separately and
in combination, either by adding the document-level features to each chunk, or by taking the
average of all chunks per document. For dimensionality reduction, we tried PCA, including
enough components so that 95% of the variance was explained, as well as SelectKBest from
scikit-learn with either mutual information regression or F-regression as the scoring function,
and retained 10% of features.
   In addition to running the classification with all texts, we conducted two experiments with
the texts that served as training data for the canonization scores. In our first approach,
we trained the model on all texts and validated it only on these non-canonized and highly-
canonized cases, and in the second approach, we both trained and validated the model on the
extreme cases.

5.2. Features
We used a wide range of established features from micro- to macro-textual levels, covering
character-based, lexical, and semantic characteristics.3 Starting on the level of individual
characters, the feature set comprises the ratio of various special characters, as, for example,
punctuation marks. On the lexical level, we have included the tf-idf4 of a word if it occurs
in at least 10% of documents and is among the 30 words with the highest tf-idf for at least
one document, as well as n-gram-based features, such as the 100 most frequent uni-, bi-, and
trigrams, and the ratio of all unique uni-, bi-, and trigrams and their entropy. The type-
token ratio and the ratio of stopwords are proxies for a chunk’s lexical diversity, the Flesch
reading ease score [7] for its readability. Additionally, the average word and paragraph length,
the text length of a chunk of 200 sentences, and the average and maximum number of words
per sentence are used as features. We also created a doc2vec embedding [12] for each chunk,
treating chunks as separate documents,
   For the modeling of semantic complexity, we implemented four variations of distances
between chunks, which were introduced by Cranenburgh, Dalen-Oskam and Zundert [6]. The
document vectors obtained by embedding techniques are interpreted as points in a vector space
so that the Euclidean distance can be calculated between them. Representing in-text semantic

   3
       An overview of all features used can be found in the appendix.
   4
       Term frequency - inverse document frequency




                                                      199
similarity,5 semantic coherence,6 semantic similarity to other texts,7 and semantic overlap with
other texts,8 these variations enable a diversified look at text similarities. We calculated all four
semantic complexity measures with both doc2vec and Sentence-BERT (SBERT) embeddings,
which is a modification of BERT that better captures semantic similarity between sentences
[15].


6. Results and Discussion
For both languages, using PCA for dimensionality reduction yielded the highest correlation
coefficients in the cross-validation.9 For the English texts, the combination of text-level features
and averages across all chunks performed best with a Pearson’s r of 0.242, and chunk-level
features delivered the best results on the German texts with a r of 0.285. Table 2 shows the
results for using SVR, PCA, and different feature levels, and Figure 2 shows the canonizations
scores versus the predicted scores. Limiting the training and/or test data to only non-canonized
and highly canonized texts produced correlation coefficients that were similar to those from
the full dataset. This can be seen as an indication that the canonization scores inferred for the
texts between the extreme cases are reliable.
  These weak correlations between the predicted and the actual canonization scores lead to the
conclusion that a model of canonization based on text-extrinsic features and a model based
on text-intrinsic features are only weakly interconnected. A closer look at some examples
shows, however, that some interesting systematic shifts between the models can be observed:
While for both the English and the German corpus, texts with the 10% highest canonization
scores were on average published during the first half and middle of the 19th century (1853
and 1834, respectively), texts with the 10% highest predicted scores center around 1873 in
the English and 1805 in the German corpus. This can be seen as an indication that what is
actually captured is the closeness to central literary periods: Texts written in the Victorian
Age dominate the highest predicted scores for the English corpus; texts from the Goethezeit
(1770-1830) those for the German corpus.


7. Conclusion and Future Work
Building upon the theoretical framework of canonization theories and previous studies focusing
on indicators of literary distinction, our approach offers a quantitative operationalization of


    5
      Intra-textual variance is calculated by summing over the distances between the document chunks and the
centroid, which is obtained by averaging over the chunk vectors. A high intra-textual variance means that the
chunks are very semantically dissimilar to each other.
    6
      Stepwise distance is similar to intra-textual variance, but instead of calculating the distance of chunks from
the centroid, the distance between consecutive chunks is calculated, indicating how rapidly a text’s semantic
content changes.
    7
      Inter-textual variance is measured with the outlier score, which is the distance of a text, represented as the
centroid of its constituting chunks, to its nearest neighbour.
    8
      The overlap score measures which fraction of the k nearest neighbours of a text’s centroid are chunks that
belong to that very text, with k being the number of chunks in the text.
    9
      In order to allow for an evaluation of feature contribution, we also included SelectKBest from scikit-learn
in the cross-validation, which assigns a score to each feature using a scoring function, and then only keeps the
k features with the highest score. However, it performed worse than PCA in terms of correlation, so we chose
a model with PCA instead.




                                                       200
Table 2
Results (SVR, PCA, various feature combinations)
                                                                       Correlation
                                                                 English       German
            All documents
               Document                                          0.184***      0.218***
               Chunk                                             0.164***      0.285***
               Document + average of chunks                      0.242***      0.230***
               Document + all chunks                             0.239***      0.191***
            Full training and and reduced test data
               Document                                          0.223**       0.258**
               Chunk                                             0.142**       0.331***
               Document + average of chunks                      0.267**       0.233**
               Document + all chunks                             0.155**       0.300**
            Reduced training and test data
               Document                                          0.289***      0.207**
               Chunk                                             0.197***      0.430***
               Document + average of chunks                      0.347**       0.269***
               Document + all chunks                             0.182         0.289***
            ***
                  p < 0.01, ** p < 0.05, * p < 0.1


canonization and some initial analyses of the relationship between a metadata-based concep-
tualization of canonization and text-inherent features. Overall, our results indicate that this
relationship is very limited. There are, however, some trends on a smaller scale that call for
more detailed analyses.
   In the next stage of our project, we will focus on those texts whose text-extrinsic and intrinsic
canonization scores differ widely. By doing so, we will be able to further investigate the patterns
of deviations. This step will also include an evaluation of the implemented features and an
analysis of their individual impact on the predictions.
   Moreover, dividing the texts into cohorts based on the publication date will help us explore
the difference between similarity to other canonized texts and canonization itself, as this would
level out the dominance of certain periods in the canon. These cohorts would also allow for a
more theory-based description of canonization processes, because, as Heydebrand and Winko
[9] have shown, value judgments and evaluative systems are highly adaptive and flexible.
   On a methodological level, our approach could be improved by adding features that require
more complex language processing, as, for example, Ashok, Feng and Choi [3] have done by
including the distribution of part-of-speech tags, syntactic production rules, or sentiments.
Finally, as we are working on historical texts from 1688-1914, the language models would have
to be trained on or adapted for historical language.


Acknowledgments
This work is part of ’Relating the Unread. Network Models in Literary History’, a project
supported by the German Research Foundation (DFG) through the priority programme SPP
2207 Computational Literary Studies (CLS). Special thanks to Ulrik Brandes and Thomas
Weitin for their feedback and support and to our anonymous reviewers for their invaluable




                                                     201
input and suggestions.


References
 [1]   M. Algee-Hewitt, S. Allison, M. Gemma, R. Heuser, F. Moretti and H. Walser. “Canon/Archive.
       Large-scale Dynamics in the Literary Field”. In: Pamphlets of the Stanford Literary Lab
       11 (2016). url: https://litlab.stanford.edu/LiteraryLabPamphlet11.pdf.
 [2]   M. Algee-Hewitt and M. McGurl. “Between Canon and Corpus: Six Perspectives on 20th-
       Century Novels”. In: Pamphlets of the Stanford Literary Lab. Pamphlets of the Stanford
       Literary Lab 8 (2015). url: http://litlab.stanford.edu/LiteraryLabPamphlet8.pdf.
 [3]   V. Ashok, S. Feng and Y. Choi. “Success with style: Using writing style to predict the
       success of novels”. In: Proceedings of the 2013 Conference on Empirical Methods in Natural
       Language Processing. Ed. by A. for Computational Linguistics. Seattle, Washington,
       2013, pp. 1753–1764. url: https://aclanthology.org/D13-1181.pdf.
 [4]   C. Bentz, D. Alikaniotis, M. Cysouw and R. Ferrer-i-Cancho. “The Entropy of Words–
       Learnability and Expressivity across More than 1000 Languages”. In: Entropy 19.6 (2017),
       p. 275. doi: 10.3390/e19060275.
 [5]   D. M. Blei, A. Y. Ng and M. I. Jordan. “Latent Dirichlet Allocation”. In: Journal of
       Machine Learning Research 3 (2003), pp. 993–1022. url: https://www.jmlr.org/papers/
       volume3/blei03a/blei03a.pdf.
 [6]   A. van Cranenburgh, K. van Dalen-Oskam and J. van Zundert. “Vector space explorations
       of literary language”. In: Language Resources and Evaluation 53.4 (2019), pp. 625–650.
       doi: 10.1007/s10579-018-09442-4.
 [7]   R. Flesch. “A new readability yardstick”. In: The Journal of Applied Psychology 32 (3
       1948), pp. 221–33.
 [8]   M. Freise. “Textbezogene Modelle: Ästhetische Qualität als Maßstab der Kanonbildung”.
       In: Handbuch Kanon und Wertung: Theorien, Instanzen, Geschichte. Ed. by Rippl, Gab-
       riele and Winko, Simone. Stuttgart, Weimar: J.B.Metzler, 2013, pp. 50–58.
 [9]   R. von Heydebrand and S. Winko. Einführung in die Wertung von Literatur. Paderborn,
       München, Wien, Zürich: Schöningh, 1996.
[10]   K. Jautze, A. van Cranenburgh and C. Koolen. “Topic Modeling Literary Quality”. In:
       Digital Humanities 2016: Conference abstracts. Ed. by M. Eder and J. Rybicki. Jagiel-
       lonian University & Pedagogical University, Kraków, 2016. url: https://dh2016.adho.
       org/abstracts/95.
[11]   K. Lagutina, N. Lagutina, E. Boychuk, I. Vorontsova, E. Shliakhtina, O. Belyaeva, I.
       Paramonov and P. G. Demidov. “A Survey on Stylometric Text Features”. In: 25th Con-
       ference of Open Innovations Association (FRUCT). Helsinki, 2019, pp. 184–195. doi:
       10.23919/fruct48121.2019.8981504.
[12]   Q. Le and T. Mikolov. “Distributed Representations of Sentences and Documents”. In:
       Proceedings of the 31 st International Conference on Machine Learning. Ed. by E. P.
       Xing and T. Jebara. Bejing, China, 2014, pp. 1188–1196. url: proceedings.mlr.press/
       v32/le14.pdf.




                                               202
[13]   M. M. Mirończuk and J. Protasiewicz. “A recent overview of the state-of-the-art elements
       of text classification”. In: Expert Systems with Applications 106 (2018), pp. 36–54. doi:
       10.1016/j.eswa.2018.03.058.
[14]   F. Moretti. Distant Reading. Verso, 2013.
[15]   N. Reimers and I. Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese
       BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Nat-
       ural Language Processing and the 9th International Joint Conference on Natural Lan-
       guage Processing. Ed. by A. for Computational Linguistics. Hong Kong, 2019, pp. 982–
       3992. doi: 10.18653/v1/D19-1410.
[16]   G. Rippl and S. Winko. “Einleitung”. In: Handbuch Kanon und Wertung: Theorien,
       Instanzen, Geschichte. Ed. by Rippl, Gabriele and Winko, Simone. Stuttgart, Weimar:
       J.B.Metzler, 2013.
[17]   Rippl, Gabriele and Winko, Simone, eds. Handbuch Kanon und Wertung: Theorien, In-
       stanzen, Geschichte. Stuttgart, Weimar: J.B.Metzler, 2013.
[18]   E. Stamatatos. “A Survey of Modern Authorship Attribution Methods”. In: Journal of
       the American Society for Information Science and Technology 60.3 (2009), pp. 538–556.
       doi: 10.1002/asi.21001.
[19]   T. Underwood. Distant Horizons – Digital Evidence and Literary Change. The University
       of Chicago Press, 2019.
[20]   T. Underwood and J. Sellers. “The Longue Durée of Literary Prestige”. In: Modern
       Language Quarterly 77.3 (2016). doi: 10.1215/00267929-3570634.
[21]   D. J. Wilson. “The harmonic mean p-value for combining dependent tests”. In: Proceed-
       ings of the National Academy of Sciences of the United States of America 116.4 (2019),
       pp. 1195–1200. doi: 10.1073/pnas.1814092116.
[22]   S. Winko. “Literatur-Kanon als invisible hand-Phänomen”. In: Literarische Kanonbildung.
       Ed. by H. L. Arnold. München: edition text + kritik, 2002, pp. 9–24.




                                              203
A. Overview of Features


Table 3
Text Features for Prediction
               Type            Feature                                Source   Text Level
               Character       Character frequency                    [11]     Chunk
                                 Ratio of punctuation marks
                                 Ratio of whitespace
                                 Ratio of digits
                                 Ratio of exclamation marks
                                 Ratio of question marks
                                 Ratio of commas
                                 Ratio of uppercase letters
               Lexical         n-grams                                [11]
                                   Unigrams (100 most frequent)                Document
                                   Bigrams (100 most frequent)                 Document
                                   Trigrams (100 most frequent)                Document
                                   Ratio of unique word unigrams               Chunk
                                   Ratio of unique word bigrams                Chunk
                                   Ratio of unique word trigrams               Chunk
                                   Unigram entropy                    [4]      Chunk
                                   Bigram entropy                     [4]      Chunk
                                   Trigram entropy                    [4]      Chunk
                               Ratio of stopwords                              Chunk
                               tf-idf                                 [13]     Document
                               Type-token ratio                       [1]      Chunk
                               Average number of words per sentence   [11]     Chunk
                               Max. number of words per sentence               Chunk
                               Average word length                    [18]     Chunk
                               Average paragraph length               [18]     Chunk
                               Text length per 200 sentences                   Chunk
                               Flesch reading ease score              [7]      Chunk
               Semantic        Intra-textual variance                 [6]      Document
                               Stepwise distance                      [6]      Document
                               Outlier score                          [6]      Document
                               Overlap score                          [6]      Document
               Embedding       Doc2Vec                                [12]     Chunk




                                                        204
B. Predicted Scores




(a) English, all documents,                                        (b) German, all documents,
    book + average chunk features                                        chunk features




(c) English, full training and reduced test data,         (d) German, full training and reduced test
    book + average chunk features                             data, chunk features




(e) English, reduced training and test data,              (f) German, reduced training and test data,
    book + average chunk features                             chunk features
Figure 2: Canonization scores vs. predicted scores (SVR, PCA, best feature level for each language)




                                                    205