=Paper= {{Paper |id=Vol-2552/Paper21 |storemode=property |title=Research of Morphemic Vocabulary Volume Based on the Material of German Short Stories |pdfUrl=https://ceur-ws.org/Vol-2552/Paper21.pdf |volume=Vol-2552 |authors=Bence Nyéki ,Vladimír Benko }} ==Research of Morphemic Vocabulary Volume Based on the Material of German Short Stories== https://ceur-ws.org/Vol-2552/Paper21.pdf
     Research of Morphemic Vocabulary Volume
    Based on the Material of German Short Stories∗
                   Bence Nyéki1                             Vladimı́r Benko2
              nyeki.bence96@gmail.com                   vladimir.benko@juls.savba.sk
                              1
                           Saint-Petersburg State University,
                         Saint-Petersburg, Russian Federation
             2
               Slovak Academy of Sciences, Ľ. Štúr Institute of Linguistics,
                                 Bratislava, Slovakia



                                                Abstract
            The dependence of the vocabulary volume on text sample size has been studied on
        the material of literary texts [Grebennikov and Assel, 2019] as well as everyday spoken
        language [Kosareva and Martynenko, 2015]. The present research concerns the study
        of the morphemic type-token ratio in samples from Franz Kafka’s and Thomas Mann’s
        short stories in German. The morphemic annotation of the samples from these texts was
        carried out manually and was aimed at finding the asymptote of the function “morpheme
        token–morpheme type.” This helps to conclude whether the expansion of the list of
        lexemes in these authors’ texts is due to the occurrence of new stems or to word formation.
            Keywords: stype-token ratio, word formation, morpheme, approximation




1       Introduction
     Our work is based on the idea that the analysis of the morphemic structure of
wordsdeservesserious attention as compounding and affixation (i.e. the concatenation of
morphemes) serve as productive tools for the creation of morphologically complex words
with semantically transparentstructure in many languages. Consequently, the total number
of morphemes in such a language is smaller than that of lexemes. As a result, it might be
easier to obtain a representative sample of morphemes from a corpus than in case of words.
Of course, it should be notedthat not every lexical unit is semantically compositional at the
morphemic level; this feature corresponds to the problem of syntagmatic idiomaticity. Despite
this fact, knowledge about the units of the morphological system of languages with rich word
formation such as German and Russian can still be considered useful. Another important
    ∗
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attri-
bution 4.0 International (CC BY 4.0).


                                                    1
aspect is that word formation is considered to play a significant role in text building. For
example, Zemskaya[1992: 164] defines six ways of word formation manifestation as an activity
in speech acts:

        • derivation from a pretext word or syntagma during speech act production

        • use of set of derivatives of the same type within a text

        • formation of different derivatives from the same base

        • use of words with identical derivational meaning

        • juxtaposition of derivatives from homonymous words

        • contrastive use of words with the same root.

     These observations suggest that the study of morphemes in text can be a source of
information for the description of an individual style.
     This research is aimed at the modelling of the morphemic vocabulary growth (in number
of morphemes) as a function of sample size on the material of German literary texts. In the
next section, the research of the type-token ratio in Russian texts is briefly reviewed. In the
present work, the attempt was made to adopt this basic idea to the quantitative study of texts
at the morphemic level. In the third section, the data used for the present research and their
annotation are discussed. Results of fitting a distribution to the datareceived are presented in
the fourthsection. The last section is a summary of our conclusions and further plans.


2         Related work
     In [Kosareva and Martynenko, 2015] research was done on the ORD corpus1 (“One
Day of Speech” corpus) to estimate the asymptote of the function modelling the type-token
dependence in spoken Russian texts. The Weibull and Haustein functions were used for
approximation, the latter of which was found to fit the data better, and the asymptotic level
of about 45,000 lexemes was estimated.With the same methods, approximations by the two
functions on the material of short stories of Russian writers were compared in [Grebennikov
and Assel, 2019] but the authors concluded that the Haustein function is not always preferable
depending on the growth stabilization. It is also worth noting that a significant difference was
found between the authors. The growthof thequantity of types in Chekhov’s short stories
considerably slows down at a sample size of around 150,000 tokens (ca 16,000 lexemes). In
case of Averchenko, however, no clear upper limit of the growth could be set.
     The linguistic annotation of the data used in the present work is based on linguistic
principles. [Mel’čuk, 2006: 390] should be mentioned as a theoretical framework and [Fleischer
and Barz, 2018] as a description of contemporary German word formation. More details will
be given in the next section.
    1
        http://www.ord-corpus.spbu.ru/SocialStudies/ORD.html


                                                     2
3       The data
3.1     Preparing the data
     In order to perform a quantitative experiment at a subword level, it is essential to have
a large amount of morphemically annotated text samples. Algorithms of automatic word
segmentation into smaller meaningful units are available. For example, MorfessorFlatCat
is based on hidden Markov models with hidden states “stem”, “prefix”, “suffix” and “non-
morpheme” to enable unsupervised machine learning [Grönroos et al. 2014]. However, it is
obvious that such an algorithm could not be applied for data annotation in our work.
     Firstly, the segmentations should be maximally precise and based on linguistic principles.
Secondly, even the correct identification of morphs, i.e. minimal meaningful substrings of
the words is unsuitable for the planned experiment. Morphs are tokens of morphemes, each
of which should be represented by a single form. For instance, the morpheme KEIT ‘-ness’
appears in forms -keit and -heit in the German words Wichtigkeit (importance) and Dunkelheit
(darkness), respectively. Thirdly, the elaboration of a fine-grained system of supplementary
morpheme tags is necessary in order to disambiguate homonymous morphemes. Due to these
conditions, the data was annotated manually.
     For the given experiment, the texts of two German authors were chosen. The short stories
of Thomas Mann in a collection available in Project Gutenberg2 as well as all literary works
(but not diary entries and private letters) of Franz Kafka in Project Gutenberg-DE3 were
copied and saved as plain texts. The short stories only may not have contained enough tokens,
so Kafka’s novels were also included into the material. Thus the document containing Kafka’s
texts (ca 290,000 words) is much larger than the collection of Mann’s short stories (ca 39,000
words) but still big enough for sampling.
     The texts were first annotated by TreeTagger4 , which is a part-of-speech (POS) tagger
that applies a Markov model and a decision tree [see Schmid 1994, 1995]. Apart from a
relatively large number of POS-tags 5 , lemmatization is also provided by the software. This
information simplifies the process of morphemic annotation as a tuple consisting of a POS-tag
and a lemma usually unambiguously determines the correct morphemic analysis.
     Sampling was implemented as follows. From each of the two collections, 60 non-
overlapping text fragments of 250 tokens each were randomly selected.

3.2     Morphemic tags
     The words in the obtained samples were analyzed manually; punctuation marks were
ignored. As mentioned above, contextual information was not usually needed to find the
right annotation, so each POS-tag–lemma tuple was processed only once. In the case when
disambiguation was necessary, the given words were analyzed in context.
     The annotation includes the assignment of a string that represents a morpheme and a tag
which is analogous to POS-tags to each identified word segment. The latter (for simplicity,
it may be referred to as a morphemic POS-tag or MPOS-tag) is a two-level tag the first
component of which takes a value from the set “ST”, “PF”, “SF”; the elements of this set stand
    2
     https://www.gutenberg.org/files/36766/36766-0.txt
    3
     https://gutenberg.spiegel.de/autor/franz-kafka-309
   4
     https://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/
   5
     Documentation for German available: https://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/
data/stts_guide.pdf


                                                  3
for “stem”, “prefix” and “suffix”, respectively (so they correspond to three of the hidden states
of model applied by MorfessorFlatCat). The second component is determined by the part of
speech of the whole word form when the given morpheme is added to the stem. It takes a value
from the set “NN”, “VV”, “ADJ”, “ADV”, “PP” “DET”, “PPER”, “PREL”, “PD”, “PUF”, “PAV”,
“PWAV”, “KOUS”, “KOUI”, “KON”, “KOKOM”, “APPR”, “PTKVZ”, “PTKZU”, “PTKNEG”,
“PTKA”, “PTKANT”, “ITJ”. These values are connected to the POS-tags assigned to the words
of the samples by TreeTagger. However, it should be noted that this set is smaller than the set
of all TreeTagger POS-tags,so each one of the morphemic annotation units may be associated
with several original TreeTagger tags. For example, attributive and substitutive/predicative
roles are not reflected at the morphemic level. A one-morpheme adjective gets the same
morphemic tag (ST-ADJ) in both attributive (ADJA) and predicative positions (ADJD). Of
course, an attributively used adjectival word form normally consists of at least two morphemes,
the last of which is an inflectional ending. Inflectional morphemes, however, were simply
ignored as their main function is marking syntactic relations within a sentence and can
hardly be considered informative from the aspect of textuality or individual style. Furthermore,
considering inflectional morphemes in the segmentation would make it impossible to determine
the correct morphemic analysis without information about the concrete word form which is
to be annotated. So-called “Fugenelemente” (meaningless segments of words on the borders of
constituent morphemes) were ignored as well. Examples are given in
Table 1-2.
      It is worth noting that word-level POS-tags are also used in modules of word sense disam-
biguation systems [Wilks and Stevenson, 1997]. This means that the method of identification
of morphemes by normal form and an MPOS-tag makes use of an analogy between morpheme
and word level.

3.3    Annotation principles
     Although it might seem trivial to define a morpheme as the smallest segmental meaningful
part of a word, in practice it can be difficult to find a theoretically supportable morphemic
analysis of a particular word. For example,it needs explanation whether the following words
are to be split into two morphemes or not:

  a) Mädchen ‘girl’

  b) bekommen ‘get, receive’

  c) entdecken ‘discover’

  d) Augenglas ‘glasses’

     All examples above were marked as single morphemes in the samples. Word a) consist
of the diminutive suffix -chen and a pseudostem which cannot be considered as a sign of
standard German. Following the concepts in [Mel’čuk, 2006: 384], any linguistic sign should
representable as a triplet of a signified, signifier and syntactics, which condition is not met by
the given pseudostem due to the lack of a signified at a synchronic level. In b) both be- and
komm(en) are valid linguistic signs. However, the first is an abstract verbal prefix while the
second is a verb meaning ‘come’. Obviously, the meaning of the derivative cannot be composed
of these semantic elements. The possible constituents of c) are ent-, which is a verbal prefix

                                                4
                   Table 1: Examples of morphological analysis. Part 1

 Meaning          Word form        POS       Lemma            Analysis          Comment
‘war years’      Kriegsjahren      NN       Kriegsjahr          ST-          Fugenelement s is
                                                            NN_krieg +           ignored
                                                            ST-NN_jahr
  ‘effects’       Wirkungen         NN         Wirkung      ST-VV_wirk
                                                                 +
                                                            SF-NN_ung
  ‘hands’           Hände          NN           Hand           ST-        The same annotation
                                                             NN_hand        is assigned to every
                                                                           allomorph of the same
                                                                                 morpheme
  ‘bitter,         bitterlich      ADJD        bitterlich       ST-
 resentful’                                                  ADJ_bitter
                                                                 +
                                                            SF-ADJ_lich
   present      zurückhaltenden   ADJA    zurückhaltend       ST-      Participle morphemes
  participle                                                PTKVZ_zurück are not considered
 from ‘hold                                                      +        inflection; also, they
    back’                                                   ST-VV_halt        are annotated
                                                                 +
                                                            SF-ADJ_end
  ‘dilute’        verdünntem      ADJA        verdünnt     PF-VV_ver
                                                               + ST-
                                                             ADJ_dünn
                                                             + SF-PP_t
‘(have) seen’       gesehen        VVPP         sehen        ST-VV_seh   The circumfix ge. . . t
                                                             + SF-PP_t   (ge. . . en is regarded
                                                                           as an allomorph) is
                                                                           marked as a single
                                                                              suffix which is
                                                                             identical to the
                                                                         participle morpheme
                                                                                   -t/-en
  ‘stood’            stand         VVFIN        stehen      ST-VV_steh      The vowel change
                                                                             within the root
                                                                           implies inflectional
                                                                             meaning so it is
                                                                                  ignored.




                                           5
                  Table 2: Examples of morphological analysis. Part 2

 Meaning         Word form    POS     Lemma       Analysis             Comment
   ‘is’             ist      VAFIN     sein      ST-VV_sei     In case of suppletion,
                                                                     the lemma is
                                                                       analyzed.
   ‘one,           man        PIS       man        ST-          ST-PUF is a special
  anyone’                                        PUF_man            tag without an
                                                                     equivalent in
                                                                TreeTaggertagset. It
                                                               is assigned to a small
                                                                 set of uninflectable
                                                                         words
   ‘this’          dieser    PDAT       dies       ST-         ST-DET (determiner)
                                                 DET_dies      is a frequent tag that
                                                               is assigned to articles
                                                                 and demonstrative
                                                                  pronouns without
                                                                      homonyms
 ‘(for) the’       dem        ART       die     ST-DET_die       The next examples
                                                                   demonstrate the
                                                               disambiguating effect
                                                               of MPOS-tags (in this
                                                                       case they
                                                                disambiguate whole
                                                                        words)
   ‘which’          die      PRELS      die        ST-
  (relative                                     PRELS_die
  pronoun)
    ‘that’          das       PDS       die      ST-PD_die
(demonstrative
  pronoun)




                                          6
that implies the cancellation of an action or a state as one of its senses, and deck(en) (cover,
protect, hide). Although not hiding and discovering something can be regarded as cognitively
associated senses, such a weak relation did not seem to be sufficient to split the word into two
parts. The compound Auge (eye) + Glas (glass) is presented in d). In fact, it is close to be
semantically compositional but still the meaning of Glas is too general.
      In linguistics, degrees of semantic compositionality, which can also be referred to as
transparency, are sometimes distinguished. In [Ransmayr et al., 2016: 267–268] eleven degrees
of transparency of German derived words with the diminutive suffix -chen were defined. The
opaquest derivatives are words without a synchronically identifiable stem, like a) above, or
with a non-noun stem like Frühchen (premature infant).The meanings of the most transparent
derivatives are simply constructed of those of their constituent morphemes. Sometimes they
can be affected by pragmatic restrictions, for example, diminutives denoting clothes such as
Jäckchen (small jacket) are usually used to refer to women’s or children’s clothes. Between
these extremes there are instances of weakly (e.g. metaphorically) motivated compositions
such as Eichhörnchen (squirrel) and Hörnchen (croissant).
      All derived words which are not maximally transparent can be considered, using the
terminology in [Mel’čuk, 2006: 390], quasimorphs, which should be stored in dictionary as
separate entries. However, a typical lexicographic problem often makes this principle more
difficult to apply in practice: it is not always clear how to define the meaning of constituent
morphemes. Taking once again the stem of c as an example, it is not self-evident without thor-
ough corpus research whether the sense ‘protect, hide’ should be considered a simple metaphor
or a word sense which needs lexicographic description.This is a usual lexicographic problem,
which inevitably occurs and must be solved more or less subjectively in each case. Despite
this fact, some principles of segmentation formulated in advance can serve as considerable
theoretic support. In view of the aspects discussed above, they can be summarized as follows:

  1. All constituents of the analyzed word should be meaningful linguistic unitsat a synchronic
     level.

  2. A bound morpheme is to be added to the morphemic vocabulary only if it occurs in
     several non-synonymous words in which it has the same sense (which although might be
     highly general or opaque).

  3. A complex unit (quasimorph) is preferable only if its meaning is not fully transparent.
     Under the assumption that word meaning can be represented as a finite set of senses,
     which are considered to be conventionalized, this means that we expect that all senses
     of a morphologically complex word w can be represented as an element of the Cartesian
     product of the constituents senses. If some senses of w are elements of the Cartesian
     product, while others are not, the correspondent quasimorph should be added to vocab-
     ulary in case of the occurrence of w with a non-compositional sense.

     These principles are simple, but they can help make consequent decisions. For example, it
follows from 2) that such morphemes as the pseudostem of Mädchen (girl) cannot be treated
as separate meaningful segments of words even if several synonymous lexemes exist in the
language with the same pseudostem, e.g. Mädchen and Mädel.This idea can be generalized
to any unique morpheme which occurs only as a constituent of a certain lexeme. Note that
a different viewpoint is also presented in [Fleischer and Barz, 2012: 65] that is based on
structural and not strictly semantic analysis.

                                               7
      The analysis of the noun Aufzug (the act of lifting, hoisting/elevator/act in theater)
shows the consequences of principle 3). If it occurs as a noun derived straightly from the verb
aufziehen (auf ‘up’ + ziehen ‘pull’), then it is correct to segment it into two constituents.
However, in a sample from Thomas Mann’s texts this is not the case. It occurs in the sense
‘act in theater’ and is therefore added to the morphemic vocabulary as a whole unit.
      It is obvious that the last principle suggests that knowledge about word senses is essential
for morphemic analysis. It makes it necessary to rely on a lexicographic resource. DWDS6
was chosen as such a resource as it ensures quick access not only to lexicographic but also
corpus data if needed. For details about DWDS see [Geyken, 2007].
      However, these ideas were not extended to verbs with separable verb prefixes (and words
compositionally derived from them) although they often form lexical units with semantically
opaque structure. Separable prefixes can take a position very far from the verbal stem in the
sentence, which makes it hard to suggest an appropriate annotation method. This remained
a problem to resolve.
      Since morphemes are not the only elementary linguistic signs [Mel’čuk, 2006: 295–297],
it is necessary to mention how non-segmental signs were handled. In German such signs are
frequently applied by means of conversion and modification. Take the nouns Schritt (step),
Eintreten (the act of coming in) and the verb beenden (finish) as examples. Schritt is derived
from schreiten (to step) by vowel change, Eintreten is obtained by applying conversion to
the verb eintreten (come in) and in beenden the noun Ende (i.e. its allomorph end) can be
observed and as be- is a verbal prefix, it must be concluded that the noun is converted to a
verb (there is also another verb enden with nearly the same meaning). As our annotation
is morphemic, these non-segmental signs are simply ignored. This means that the analyses
of these words are ST-VV_schreit, ST-PTKVZ_ein + ST-VV_tret and PF-VV_be + ST-
NN_ende, respectively. Of course, this is a simplification: these signs are frequent in German
and the morphological process of conversion is highly productive.
      Now that all major problems of the data annotation are discussed, results can be pre-
sented.


4       Results
     Having annotated the text fragments, it was necessary to find a distribution which can
model the growth of morphemic vocabulary as a function of sample size. As mentioned above,
the Weibull and Haustein functions were applied for similar goals [Kosareva and Martynenko,
2015; Grebennikov and Assel, 2019]. Our paper is not aimed at comparing how different
functions fit the data. Only cumulative Weibull distribution was chosen for modelling, which
is usually defined as follows:
                                                   x β
                                       y = 1 − e−( η )
    In related work, a slightly different equation is given as the Weibull function [Grebennikov
and Assel, 2019: 380]:
                                                            d
                                     y = Nmax − Nmax e−cx
        Apart from the exponent, the difference is that the right side of the last equation is
    6
    Digitales Wörterbuch der deutschen Sprache (Digital Dictionary of the German Language):
https://www.dwds.de/


                                                8
multiplied by Nmax which is the asymptote of the function, i.e. the theoretical maximal
volume of the morphemic vocabulary.
     To simplify computation, data was manipulated to enable the use of the former (more
standard) formula. Firstly, empirical values (the registered number of lexemes at a given
sample size) were divided by a hypothetical value of maximal volume. Then the equation was
linearized (analogously to [Kosareva and Martynenko, 2015]) to estimate the parameters of
Weibull distribution using the slope and intersect of the linear regression. Now theoretical
values of y could be calculated, which were multiplied by Nmax . The most appropriate value
of Nmax was estimated by the least-squares method: it was determined as the highest observed
vocabulary volume plus 50 (it was clear from the data that the asymptote was significantly
higher than any observed value). Then 50 was added to Nmax again in each following step
until the least sum of quadratic deviations between the theoretical and observed values of y
was reached (some steps could be omitted adding immediately more than 50 to the upper
limit of vocabulary volume). Of course, obtained values are approximate and they can be
defined more precisely; still the results clearly show the difference between the texts of the
two German authors.
     Tables 3-4 show some hypothetical values of Nmax and the corresponding sum of quadratic
deviations. The data necessary for calculating theoretical values of y, given the Nmax which
has eventually proved best, are presented in Tables 5-6. These tables serve as illustrations
and they contain only every third observed value of the morphemic type-token function (of
course, the whole sets of observations were used to find distribution parameters). The curves
of cumulative Weibull distributions determined by the calculated parameters and Nmax are
depicted in Figure 1.



Table 3: Hypothetical values of Nmax and the corresponding sums of quadratic deviations for
                             samples from Franz Kafka’s texts

                                   Nmax      Sum of
                                            quadratic
                                            deviations
                                    1793      148624
                                    2293     22794,88
                                    2493     16697,63
                                    2543     16109,97
                                    2593     15775,62
                                    2643     15652,69
                                    2693     15706,58




     Figure 1 shows that the growth of the functions is nearly identical until vocabulary size
reaches approximately 1000 morphemes. After that the growth of Kafka’s vocabulary slows
down and stabilizes at a sample size of about 60 thousand morpheme tokens. The other
function has a considerably higher asymptote and stabilizes at about 80 thousand morpheme
tokens.

                                              9
Table 4: Hypothetical values of Nmax and the corresponding sums of quadratic deviations for
                            samples from Thomas Mann’s texts

                          Nmax    Sum of quadratic deviations
                          2154             264363,87
                          2654             64892,02
                          3154             30309,74
                          3354              26620,6
                          3554              25473,1
                          3604             25460,79
                          3654             25534,69




   Table 5: Calculations for finding distribution parameters and theoretical values for the
   samples from Franz Kafka’s texts. The highest observed volume of vocabulary is 1743
        morphemes. Nm ax = 2643, β = 0.69, η = 13573.52, Σ(yi − yj )2 = 15652.69.

 Sample Observed ln(x)        yi /Nmax      ln(−ln(1 −        Weibull Theoretic Quadratic
 size x value yi                             yi /Nmax ))      value   value yj   devia-
                                                                                 tions
  726        315       6,59      0,12          -2,06           0,12    327,12    146,98
 1453        496       7,28      0,19          -1,57           0,19    507,97    143,23
 2213        675       7,7       0,26          -1,22           0,25    656,35    347,91
 2936        803       7,98      0,3           -1,02           0,29    775,03    782,11
 3669        908       8,21      0,34          -0,87           0,33    879,94    787,22
 4389        990       8,39      0,37          -0,76           0,37    971,58    339,14
 5109        1062      8,54      0,4           -0,67           0,4     1054,26    59,95
 5796        1129      8,66      0,43          -0,58           0,43    1126,25    7,54
 6516        1196      8,78      0,45          -0,51           0,45    1195,62    0,14
 7232        1246      8,89      0,47          -0,45           0,48    1259,3    176,94
 7973        1302      8,98      0,49          -0,39           0,5     1320,37    337,4
 8724        1359      9,07      0,51          -0,33           0,52    1377,86   355,72
 9476        1417      9,16      0,54          -0,26           0,54    1431,51   210,51
 10209       1467      9,23      0,56          -0,21           0,56    1480,44   180,52
 10949       1511      9,3       0,57          -0,16           0,58    1526,8    249,71
 11698       1551      9,37      0,59          -0,12           0,59    1570,93   397,28
 12434       1609      9,43      0,61          -0,06           0,61    1611,8     7,85
 13127       1659      9,48      0,63          -0,01           0,62    1648,22   116,27
 13842       1700      9,54      0,64           0,03           0,64    1683,86   260,59
 14607       1743      9,59      0,66           0,07           0,65    1719,99   529,59




                                             10
 Table 6: Calculations for finding distribution parameters and theoretical values for the
samples from Thomas Mann’s texts. The highest observed volume of vocabulary is 2104
       morphemes. Nmax = 3604, β = 0.72, η = 17676.11, Σ(yi − yj )2 = 25460.79.

Sample Observed ln(x)       yi /Nmax      ln(−ln(1 −       Weibull Theoretic Quadratic
size x value yi                            yi /Nmax ))     value   value yj   devia-
                                                                               tions
 700      332       6,55      0,09           -2,34          0,09    334,84      8,08
1367      533       7,22      0,15           -1,83          0,15    526,67     40,04
2129      739       7,66      0,21           -1,47          0,2     704,28    1205,4
2826      867       7,95      0,24           -1,29          0,23    843,63    546,03
3537      990       8,17      0,27           -1,14          0,27    969,93      403
4228      1107      8,35      0,31             -1           0,3     1080,9     681,4
4897      1184      8,5       0,33           -0,92          0,33    1179,42      21
5575      1283      8,63      0,36           -0,82          0,35    1271,79   125,72
6308      1369      8,75      0,38           -0,74          0,38    1364,44    20,82
6911      1425      8,84      0,4            -0,69          0,4     1435,76   115,86
7629      1489      8,94      0,41           -0,63          0,42    1515,63   709,07
8328      1566      9,03      0,43           -0,56          0,44    1588,66    513,7
9005      1634      9,11      0,45            -0,5          0,46    1655,43   459,37
9705      1698      9,18      0,47           -0,45          0,48    1720,76   517,99
10414     1747      9,25      0,48           -0,41          0,49    1783,43   1327,29
11161     1824      9,32      0,51           -0,35          0,51     1846     483,93
11883     1900      9,38      0,53           -0,29          0,53    1903,38    11,42
12588     1964      9,44      0,54           -0,24          0,54    1956,71    53,09
13351     2025      9,5       0,56           -0,19          0,56    2011,67   177,77
14085     2104      9,55      0,58           -0,13          0,57    2062,02   1762,4




                 Figure 1: Type-token ratio functions for morphemes.




                                           11
5    Conclusions
     A logical interpretation of the results is that there are more foreign roots in Thomas
Mann’s short stories than in Franz Kafka’s texts. It can be noticed that Mann uses more
foreign proper names (e.g. Florentinum, Fontana). However, it seems that the type-token ratio
should be considered at word level as well in order to justify this hypothesis. For example, an
author whose morphemic vocabulary grows slowly but uses many different lexemes could rely
on word formation to support expressivity.
     In the present paper, it has been showed that quantitative aspects of derivational
morphology can be regarded as a feature of individual style. A significant difference has been
found between the growth of morphemic vocabulary in two German authors’ texts. In future
research, larger samples need to be taken in order to compare the dependence of the number
of lexeme and morpheme types on sample size. As word formation is a productive linguistic
and cognitive process, this may considerably contribute to the quantitative research of style.


References
[Zemskaya, 1992] Zemskaya E. (1992) Word formation as an activity. (In Rus.) = Slovoobra-
  zovaniekakdeyatel’nost’. Nauka, Moscow, Russia. – 221p.

[Kosareva and Martynenko, 2015] Kosareva E. O., Martynenko G. Ya. (2015)             The
  Type–TokenRatio in Everyday Spoken Russian. Structural and Applied Linguistics, Vol.
  11. (In Rus.) = Otnoshenietekst – slovarv povsednevnoyustnoyrechi. Strukturnaya I prik-
  ladnayalingvistika, Vol 11. Saint Petersburg, Russia. Pp. 220–228

[Grebennikov and Assel, 2019] Grebennikov A. O., Assel A. N. (2019) XIX–XX Centuries’
  Russian Short Stories Corpus. ApproximationModels. Proceeding of the International
  Conference «Corpus Linguistics–2019». Saint Petersburg University Press. (In Rus.) =
  Bazarusskogorasskaza XIX–XX vekov. Modeliapproksimatsii. Trudy mezhdunarodnoykon-
  ferentsii«Korpusnayalingvistika – 2019».Izdatel’stvo Sankt-PeterburgskogoUniversiteta,
  Saint Petersburg, Russia.Pp. 379–386

[Mel’čuk, 2006] Mel’čuk I. A. (2006): Aspects of the Theory of Morphology. Trends in Lin-
  guistics. Studies and Monographs 146. Mouton de Gruyter, Berlin, Germany. – 616p. Avail-
  able at https://anekawarnapendidikan.files.wordpress.com/2014/04/aspects-of-the-theory-
  of-morphology-by-igor-melcuk.pdf

[Fleischer and Barz, 2012] Fleischer W., Barz I. (2012) Word Formation of Contemporary
  German (In Ger.) = Wortbildung der deutschen Gegenwartssprache. Fourth, revisededition.
  De Gruyter, Berlin, Germany – 481p.

[Grönroos et al., 2014] Grönroos S.-A., Virpioja S., Smit P., Kurimo M. (2014)MorfessorFlat-
  Cat: An HMM-based method for unsupervised and semi-supervised learning of morphology.
  Proceedings of the 25th International Conference on Computational Linguistics. Associa-
  tion for Computational Linguistics, August 2014, Dublin, Ireland. Pp.1177-1185. Available
  at https://www.aclweb.org/anthology/C14-1111.pdf

                                              12
[Schmid, 1994] Schmid, H (1994):      Probabilistic Part-of-Speech Tagging Using De-
   cision Trees. Proceedings of International Conference on New Methods in Lan-
   guage Processing, Manchester, UK. Pp. 44–49. Available at https://www.cis.uni-
   muenchen.de/ schmid/tools/TreeTagger/data/tree-tagger1.pdf

[Schmid, 1995] Schmid, H. (1995) Improvements in Part-of-Speech Tagging with an Ap-
   plication to German. Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland. Pp.
   1–9. Available at https://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/data/tree-
   tagger2.pdf

[Wilks and Stevenson, 1997] Wilks Y. and Stevenson M. (1997) Sense Tagging: Semantic
  Tagging with a Lexicon. ACL SIGLEX workshop, Washington, DC. Pp. 74–78

[Ransmayr et al., 2016] Ransmayr J., Schwaiger S., Durco M., Pirker H., Dressler W. U.
  (2016)Grading of the Transparency of Diminutives Ending in -chen: a corpus linguis-
  tic research. German Language. Journal for Theory, Practice, Documentation 44(3). (In
  Ger.) = Gradierung der Transparenz vonDiminutiven auf -chen: Eine korpuslinguistische
  Untersuchung. Deutsche Sprache. ZeitschriftfürTheorie, Praxis, Dokumentation 44(3). Pp.
  261–286

[Geyken, 2007] GeykenA. (2007) The DWDS corpus: A reference corpus for the German
  language of the 20th century. In: Collocations and Idioms: Linguistic, lexicographic, and
  computational aspects. Ed. by Fellbaum C. London, UK. Pp. 23–41 Draft available at
  https://www.dwds.de/dwds_static/publications/text/DWDS-Corpus_Desc4_draft.pdf




                                            13