Statistical Characteristics of Roman Ivanychuk’s Idiolect (Based on
Writer’s Text Corpus)
Nataliia Lototska1

Lviv State University of Life Safety, Kleparivska str. 35, Lviv, 79013, Ukraine


            Abstract
            The paper presents the statistical study of Roman Ivanychuk’s historical novels. It is pointed out that
            his idiolect hasn’t been the subject of linguistic and statistical research yet. The analysis of author’s
            text and its lexicon reflects the individuality of linguistic preference.
            It is known that statistical methods make possible to identify quantitative markers of text and
            vocabulary, and, in turn, give them a qualitative interpretation. Text corpus serves as a useful tool for
            discovering many aspects of language use that might be otherwise left undetected.
            For an integrated study of the writer’s idiolect the corpora of Roman Ivanychuk’s texts and Ukrainian
            literary prose were created based on GRAC. Its functionality enables to obtain the data on frequency
            of parts of speech and present various morphological statistical characteristics (index of epithetization,
            index of verb + adverbial complement, level of nominalization), statistical parameters of vocabulary
            and text (text size, size of vocabulary of lexemes, diversity index, average word frequency, hapax
            legomena, exclusivity index of text and vocabulary, concentration index of text and vocabulary),
            frequency zones of vocabulary.
            The results of this study may be interpreted as individual manner of author’s writing and applied for
            text identification and further research of individual language.

            Keywords
            Frequency, frequency rank, idiolect, statistical characteristics, parts of speech, text corpus.

1. Introduction

    The figure of Roman Ivanychuk (1929–2016) is significant in Ukrainian literature of the second half of
the 20th century and the beginning of the 21st century. Firstly, the writer is known for his historical novels
and short stories. In addition, he wrote numerous novels, memoirs, interviews, and journalistic texts. Roman
Ivanychuk’s texts have constantly been subjects of interest to literary critics and linguists. However, the
writer’s idiolect hasn’t been the subject of thorough linguistic and statistical analysis yet.
    The research of the writer’s style in linguistics is mainly carried out on his literary texts. The text is the
basis for the idiolect study as well. The individuality of the author’s language, his manner is reflected in the
preference for certain lexical, morphological, syntactic, phonetic means in the text [41].
    Representatives of text theory consider that an individual text is a system united by communicative
integrity, logical, grammatical, and stylistic relations [67]. Specificities of the use of a particular unit in a
certain text determine its functional properties: frequency, position, compatibility, which depends on text
nature, functional or author’s style and varies from text to text [47].
    The author’s lexicon reflects the idiolect most specifically. “The linguistic personality of the writer is
revealed through the individually used word, his artistic and individual picture of the world is reflected in
linguistic expression” [48, p. 11].
    The writer’s speech as a marker of linguistic personality makes it possible to follow his / her manner of
choice and use of words, whereas the language picture of the author’s world as a representative of a
particular linguistic and cultural community is displayed in his literary texts [26].


COLINS-2022: 6th International Conference on Computational Linguistics and Intelligent Systems, May 12–13, 2022, Gliwice, Poland
EMAIL: nata07lototska@gmail.com (N. Lototska)
ORCID: 0000-0001-6692-196X (N. Lototska)
              ©️ 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
   The author’s text reveals the concept of linguistic personality through which the writer creates a certain
fragment according to the world picture presented in his / her cognition [25].


2. Related works

    The notion of idiolect is well-known in linguistics, although an exact definition and the very existence
of the phenomenon is a subject of studies and debates. Norbert Dittmar offers a definition of idiolect as the
language of the individual, which because of the acquired habits and the stylistic features of the personality
differs from that of other individuals [14, p. 111].
    Numerous researchers have analyzed the idiolect of certain authors synthesizing various disciplines:
linguostylistics [26, 48, 63, 64], cognitive linguistics [30], quantitative linguistics [5, 6, 7, 13, 17, 19, 27,
32, 33, 41, 44, 51, 60, 61] and literature studies.
    An idiolect is a set of language characteristics of a native speaker; a person’s specific, unique way of
speaking or writing. The past lack of interest in idiolects derives from the difficulty in obtaining appropriate
data and, on a theoretical level, it arises in some cases from a general dismissal of usage as being
uninteresting and in others from an understandable focus on the general rather than the particular [3].
However, the notion of idiolect remains an understudied topic, especially in quantitative linguistics, due to
the insufficiency of relevant large corpora [3, 4, 39].
    Corpus linguistic research offers strong support for the idea that language variation is systematic and
can be described using empirical, quantitative methods [18]. Anatol Stefanowitsch defines a corpus as a
large collection of authentic text (i.e., samples of language produced in genuine communicative situations)
[58]. A corpus is “a collection of pieces of language text in electronic form” [56, p. 19], moreover text
corpus presents useful statistical information such as number of word types, frequency, co-occurrences [4].
    Writer’s text corpus belongs to an independent corpus and can be a part of a general language corpus of
a certain language or exists as a separate entity, and can provide a detailed study of a writer’s lexicon and
open prospects for further research. Text corpus possesses a huge potential in the study of writer’s language
both in qualitative and quantitative aspects, moreover some facts of writer’s style can be revealed only with
the help of a text corpus, such as vocabulary richness, indexes of variety and exclusiveness, etc. [5, 6, 7,
66]
    Nowadays the texts of Taras Shevchenko [63], Hryhoriy Skovoroda [49], Yuri Shevelyov [11; 61],
Mykhailo Kotsyubynsky [60], Bohdan Lepky [57], Ivan Franko [6, 7], Vasyl Stefanyk [19] and others are
studied by means of corpus linguistics.
    Ways of individual style expression in a natural can be automatically detected with the use of
computational linguistics methods [29], as fiction works tend to be long and provide large quantities of data
[53]. The use of corpus-based approach and the application of statistical method allow to solve the problems
of author attribution that helps in identifying different types of texts and can be used in plagiarism detection,
author’s identification and resolving disputed authorship [12, 15, 21, 22, 40, 35, 36, 37, 50, 62].
    Statistical studies of Ukrainian literary texts have been used to study their lexical and stylistic
peculiarities. The prospects and relevance of the use of quantitative methods (and not only interpretive) in
literary texts are traced in the following researches [1, 2, 3, 11, 17, 19, 27, 28, 29, 32, 33, 34, 42, 44, 45,
51, 53, 59, 68].
    Ihor Kulchytskiy’s study [28] is on the individual style of writing of Roman Ivanychuk researched by
the means of statistics in order to find some distinctive features of the Ivanychuk’s writing to reveal his
special manner comparing to other Ukrainian authors. According to Solomiya Buk [5, 6] statistical
parameters of Ivan Franko’s dictionary such as ratio of hapax legomena and frequently used lemmas help
to find out and to describe more precisely the important features of author’s style.
    Research of parts of speech in the writer’s text and dictionary is considered an important stage to
establish individual features. The peculiarities of the behavior of parts of speech are revealed in different
writer’s texts [8, 9, 46, 51, 52], in particular in Ukrainian fiction texts of Ivan Franko, Oleksandr
Dovzhenko, Hryhir Tyutyunnyk, Yuriy Smolych, Myhailo Stelmakh, and others.
    An analysis of the research studies of the writer’s idiolect presents tendencies to its lexicographic
parameterization, quantitative analysis and the application of computer technologies in particular on the
basis of a text corpus [5, 6, 7, 11, 19, 27, 28, 32, 33, 34].
3. Methods and materials

3.1.    Statistical approach in the idiolect study

     The author’s text reflects his / her world view and demonstrates the richness of the lexicon of his / her
 linguistic personality, while the statistical approach allows to identify quantitative markers of the idiolect
 and, in turn, gives them qualitative interpretation.
     Quantitative analysis used in idiolect study allows to avoid methodological mistakes frequently caused
 by researcher’s subjectivity [32]. Statistical approaches applied to different writers’ texts may reveal
 characteristics which differ them one from the others and therefore present individual creative manner of a
 writer [45].
     Yuriy Pavlov and Elizaveta Tikhomirova consider low-frequency vocabulary as an idiolect marker [44,
 p. 9]. Tatiana Demidova adds that the author’s linguistic taste, his literary inclinations and the richness of
 his linguistic personality are reflected in low-frequency vocabulary [13]. Mihail Muhin offers the analysis
 of the writer’s frequency vocabulary to identify the idiostyle features [41].
     Statistics is an important tool for linguistic data analysis in modern linguistics. In addition, quantitative
 methods ensure reliability of results, allow to reveal language units and text structure properties, any
 research is impossible without statistical studies [32]. The fact that the language itself is a complex system
 subordinated to the laws of statistics proves the necessity of using statistical methods in linguistics [46].
     Text quantitative characteristics allow to determine the qualitative characteristics of the writer’s idiolect
 objectively [33]. It is generally acknowledged there is an internal interdependence between the qualitative
 and quantitative features of language structure, which determines the subordination of frequency of
 language units in speech to certain statistical patterns [31, p. 5].
     Statistical methods are widely used in linguistics and have become one of the most efficient and time-
 saving tools of processing different sets of texts [27]. These methods allow to obtain accurate data of lexical
 units in context, to obtain data on frequency of occurrences, words, lemmas, grammatical categories. In
 addition, search results can be ranked by different parameters, and we are able to set threshold values thus
 making it possible to obtain meaningful information [20, p. 66].
     Statistical studies provide the opportunity to compare the proportion of parts of speech in the writer’s
 texts, reflect the quantitative characteristics of the writer’s lexicon, which represents information about the
 stylistic features of the writer at the lexical level objectively, and, vice versa, identify the words that do not
 function in society during his creative activity [5, 6, 7].
     Statistical analysis of the historical prose fiction of Roman Ivanychuk enables to demonstrate individual
 manner of author’s writing. The topicality of the research lies in the lack of thorough idiolect research of
 Roman Ivanychuk’s literary legacy, a need for an integrated study of his lexical system based on the text
 corpus and by means of modern research methods.

 3.2.    Writer’s text corpus as a tool to reveal statistical characteristics of idiolect

    The use of text corpus provides reliable criteria for determining the acceptability and the evaluation of
 certain linguistic phenomena use, allows to obtain accurate data on the lexical structure of language, and
 the relative frequency of some lexical items (words) use [58].
    The creation of text corpus involves an integrated processing of writer’s lexicon that represents an
 opportunity to carry out more advanced and perspective studies of his / her literary texts [5, 6, 7].
    The subject of our research is the statistical characteristics of Roman Ivanychuk’s historical novels. To
 accomplish similar study for Roman Ivanychuk the corpus of his prose fiction texts has been created. This
 corpus comprises 16 historical novels and 1 historical trilogy written throughout 1962-2016 (total corpus
 size is 1,295 million words): At The Edge Of The Paven Way (Krai bytoho shliakhu), Mallows (Malʹvy),
 Red Wine (Cherlene vyno), Manuscript From Ruska Street (Manuskrypt z vulytsi Rusʹkoyi), Water From
 The Stone (Voda z kameniu), The Fourth Dimension (Chetvertyi vymir), Scars On The Rock (Shramy na
 skali), Crane’s Cry (Zhuravlynyi kryk), Because War Is War (Bo viyna viynoyu), Horde (Orda), The Gospel
 Of Thomas (Yevanheliye vid Tomy), Pillars Of Fire (Vohnenni stovpy), Saxaul In The Sands (Saksaul u
 piskakh), Across The Pass (Cherez pereval), Pilgrimage (Khresna proshcha), Voices From Above The
 Waters Of Kinneret (Holosy z-nad vod Henisareta), I Have Not Written About Donbass Yet (Ya shche ne
 pysav pro Donbas).
   The texts of the novels were converted into an electronic form, the next step was the normalization of
the texts in the MS Word editor [27]. “Text normalization process contains the following stages:
normalization of coding, normalization of graphics, text proofreading, technical normalization of
punctuation” [27, p. 58].
   The next step was to upload these texts into GRAC [55] and create the Roman Ivanychuk’s subcorpus
(RITC). The GRAC makes it possible to search any linguistic phenomenon using NoSketchEngine interface
that in turn enables search by lemma, word form and grammatical tags, visualization of frequencies as a
concordance, customization of text filters (texts of a given period, style, original language, etc.) [54].
GRAC’s functionality also provides automatically retrieved full information about word-forms, lemmas,
parts of speech and their frequency etc. “GRAC is intended to be a universal tool for a wide range of
research questions” [55].

4. Experiment and results

4.1. Statistical characteristics of Roman Ivanytchuk’s lexicon as an idiolect
marker

    For multifaceted idiolect study linguists make a quantitative description of writers’ texts, which provides
accurate information about the peculiarities of vocabulary functioning in these texts [7, p. 86].
    To carry out integrated research of the author’s idiolect, the subcorpus of Ukrainian prose fiction
(UPFTC) was created in the GRAC by applying filters like style Fiction (DOC.STYLE — FIC), original
language Ukrainian (DOC.ORIGINAL — UK), time span (DOC.DATE — 1960–2016).
    Roman Ivanychuk’s subcorpus data are compared with the data of Ukrainian prose fiction text corpus
for the period of 1960-2016 to reveal the peculiarities of Roman Ivanychuk’s idiolect.
    Frequency of the parts of speech analysis, vocabulary ranking, calculation of morphological and
statistical indicators for vocabulary, statistical characteristics of the vocabulary and text display frequency
patterns of the idiolect, which allows the authorization of the text and its further automatic processing.
    The study of the parts of speech frequency in the text is essential for revealing individual writer’s
characteristics [46, p. 186]. To present the statistical characteristics of nouns, adjectives, verbs, adverbs,
prepositions in Roman Ivanytchuk’s texts CQL queries are used (other parts of speech aren’t studied due
to problems with removing homonymy). The relative frequency of the mentioned parts of speech in Roman
Ivanychuk’s subcorpus have been calculated and are manifested in the table 1.

Table 1
The frequency of parts of speech in RITC
 №
      Text title                                          Noun     Adjective Verb        Adverb Preposition
 1    At The Edge of the Paven Way
      (Krai bytoho shliakhu)                              38,7     14,7         19,1     10,4      10,4
 2    Mallows (Malʹvy)                                    40,8     16,4         18,1     8,8       11,1
 3    Red Wine (Cherlene vyno)                            40,5     15,4         17,4     8,1       11,5
 4    Manuscript from Ruska Street
      (Manuskrypt z vulytsi Rusʹkoyi)                     39,8     15,9         17,7     8,9       10,7
 5    Water from the Stone (Voda z kameniu)               39       15,9         17,8     9,1       10,9
 6    The Fourth Dimension (Chetvertyi vymir)             41,5     16,8         16,9     9,4       10,8
 7    Scars on the Rock (Shramy na skali)                 41       16,9         17,2     8,8       10,9
 8    Crane's Cry (Zhuravlynyi kryk)                      40,9     16,3         17,7     9         10,8
 9    Because War Is War (Bo viyna viynoyu)               40,9     16,2         17,4     9,1       11,7
 10 Horde (Orda)                                          42,5     16,7         17,4     8,5       11,3
 11 The Gospel of Thomas (Yevanheliye vid Tomy)           42,3     18,3         16,6     8,1       11,7
 12 Saxaul in The Sands (Saksaul u piskakh)               40,2     16,5         17,1     9,3       11,5
 13 Pillars of Fire (Vohnenni stovpy)                     39,5     15,9         17,8     9,5       12,1
 14 Across The Pass (Cherez pereval)                      40,2     17,8         17,1     9,3       11,4
 №
      Text title                                           Noun    Adjective Verb        Adverb Preposition
 15   Pilgrimage (Khresna proshcha)                        41,8    18,5      16,7        8,8    12,2
 16   Voices from above The Waters of Kinneret
      (Holosy z-nad vod Henisareta)                        42,6    19,1          15,9    7,3     12,6
 17   I Have Not Written About Donbass Yet
      (Ya shche ne pysav pro Donbas)                       41,4    18,2          15,8    8,2     11,9

    The obtained data in table 1 demonstrate that the relative frequency of nouns in RITC fluctuates within
the range 38,7 (Krai bytoho shliakhu) and 42,6 (Holosy z-nad vod Henisareta), adjectives — 14,7 (Krai
bytoho shliakhu) and 19,1 (Holosy z-nad vod Henisareta), verbs — 15,8 (Ya shche ne pysav pro Donbas)
and 19,1 (Krai bytoho shliakhu), advebs — 7,3 (Holosy z-nad vod Henisareta) and 10,4 (Krai bytoho
shliakhu), prepositions — 10,4 (Krai bytoho shliakhu) and 11,9 (Holosy z-nad vod Henisareta). In his early
works the frequency of nouns and adjectives is higher than the frequency of verbs and adverbs, as compared
to the texts of later period.
    The abovementioned data in Roman Ivanychuk’s subcorpus and those from the subcorpus of Ukrainian
prose fiction are compared and presented in the table 2.

Table 2
The frequency of parts of speech in RITC and UPFTC
                  Part of speech            RITC                   UPFTC
                  Noun                       40,1                  38,5
                  Adjective                  16,6                  16
                  Verb                       17,5                  17,7
                  Adverb                     9                     11,3
                  Preposition                11,4                  10,1

   The frequency of the parts of speech in RITC correlates with data retrieved in UPFTC, however the
frequency of nouns in RITC is higher by ≈ 5%, prepositions is by ≈ 11% than in UPFTC, in contrast to the
frequency of adverbs, which is higher by ≈ 20 %.


                 40,1 38,5


                                 16,6 16         17,5 17,7
                                                                          11,3      11,4 10,1
                                                                    9


                  Noun          Adjective           Verb           Adverb         Preposition

                                                 RITC      UPFTC

Figure 1: The frequency of parts of speech in RITC and UPFTC

    The study of Roman Ivanychuk’s idiolect at the morphological level represents morphological statistical
characteristics, such as the index of epithetization (correlation between total noun occurrences and total
adjective occurrences), the index of verbal definitions (correlation between total adverb occurrences and
total verb occurrences), the level of nominalization (correlation between total noun occurrences and total
verb occurrences) [5, 6]. The data received are presented in the table 3.
Table 3
Morphological statistical characteristics in RITC and UPFTC
              Morphological statistical characteristics     RITC                                                                                     UPFTC
                   index of epithetization                                                       2,44645                                             2,41239
                   index of verbal definitions                                                   0,517969                                            0,642332
                   level of nominalization                                                       2,319452                                            2,181664

    The data in the table 3 indicate that the number of adjectives per noun in RITC is higher than in UPFTC.
The ratio of adverbs to verbs demonstrates higher number of verb collocations in RITC than in UPFTC.
These data serve as an idiolect peculiarity.
    Yuhan Tuldava considers the vocabulary in text as a system and supposes to study it by means of
“quantitative” mathematics that, in turn, permits to identify and comprehend its system properties [38].
Galina Napreenko [42] suggests the word frequency in the text is an identification parameter to determine
its authorship. The frequency of a lexical unit is an important characteristic of a word, as it indicates the
activity of its functioning in the text, its value in statistical structure of the text.
    In this study the statistical parameters of author’s vocabulary and text are the following: text size (N) —
total number of words in the text; size of vocabulary of lexemes (V) — a number of lemmas in the text;
diversity index (V/N) — a ratio of size of vocabulary of lexemes (V) to text size (N); average word
frequency in the text (N/V) — the ratio of text size (N) to size of vocabulary of lexemes (V); hapax legomena
(V1) — a number of lexemes with frequency 1; exclusivity index of vocabulary (V1/V) — the ratio of
number of lexemes with frequency 1 (V1) to total number of lemmas (V); exclusivity index of text (V1/N)
— the ratio of number of lexemes with frequency 1 (V1) to text size (N); concentration index of vocabulary
(V10/V) —the ratio of lexemes with frequency 10 (V10) and more to size of vocabulary of lexemes (V);
concentration index of text (V10t/N) — the ratio of words with frequency 10 (V10) and more to text size
(N) [5, 6].
    As a result of RITC and UPFTC analysis we received the data regarding the vocabulary in the texts
under study which are presented in the table 3.

Table 3
Statistical parameters in RITC
                                                                                                 Average word frequency


                                                                                                                                                                                           of vocabulary (V10/V)
                                                                                                                          Hapax legomena (V1)


                                                                                                                                                of vocabulary (V1/V)


                                                                                                                                                                                           Concentration index


                                                                                                                                                                                                                   Concentration index
                                                        Size of vocabulary


                                                                                                                                                Exclusivity index


                                                                                                                                                                       Exclusivity index
                                                                             Diversity index (


                                                                                                                                                                                                                   of text (V10t/N)
                                                        of lexemes (V)


                                                                                                                                                                       of text (V1/N)
                                        Text size (N)


                                                                                                 (N/V)
                                                                             V/N)


      Text title

 1    At The Edge of the Paven Way
      (Krai bytoho shliakhu)           119231           13231                0,111               9                        6133                  0,051                  0,464               0,757                   0,108
 2    Mallows
      (Malʹvy)                         69386            10411                0,15                6,7                      5247                  0,076                  0,504               0,715                   0,090
 3    Red Wine
      (Cherlene vyno)                  46960            8723                 0,186               5,4                      4784                  0,102                  0,548               0,641                   0,075
 4    Manuscript from Ruska Street
      (Manuskrypt       z    vulytsi
      Rusʹkoyi)                        61715            9864                 0,159               6,3                      5104                  0,083                  0,517               0,672                   0,086
 5    Water from the Stone
      (Voda z kameniu)                 69148            11569                0,167               5,9                      5983                  0,087                  0,517               0,677                   0,079
 6    The Fourth Dimension
      (Chetvertyi vymir)               60693            10745                0,177               5,6                      5748                  0,095                  0,535               0,687                   0,075
 7    Scars on the Rock
      (Shramy na skali)                69456            12120                0,174               5,7                      6432                  0,093                  0,531               0,679                   0,072
 8    Crane's Cry
      (Zhuravlynyi kryk)               125383           16278                0,129               7,7                      7645                  0,061                  0,470               0,743                   0,103
                                                                                               Average word frequency


                                                                                                                                                                                         of vocabulary (V10/V)
                                                                                                                        Hapax legomena (V1)


                                                                                                                                              of vocabulary (V1/V)


                                                                                                                                                                                         Concentration index


                                                                                                                                                                                                                 Concentration index
                                                      Size of vocabulary


                                                                                                                                              Exclusivity index


                                                                                                                                                                     Exclusivity index
                                                                           Diversity index (


                                                                                                                                                                                                                 of text (V10t/N)
                                                      of lexemes (V)


                                                                                                                                                                     of text (V1/N)
                                      Text size (N)


                                                                                               (N/V)
                                                                           V/N)
      Text title

 9    Because War Is War
      (Bo viyna viynoyu)              71317           12128                0,17                5,9                      6385                  0,090                  0,526               0,682                   0,078
 10   Horde
      (Orda)                          59715           10326                0,173               5,8                      5465                  0,092                  0,529               0,872                   0,075
 11   The Gospel of Thomas
      (Yevanheliye vid Tomy)          92015           13118                0,142               7                        6416                  0,070                  0,489               0,773                   0,095
 12   Saxaul in The Sands
      (Saksaul u piskakh)             62087           11207                0,181               5,5                      6048                  0,097                  0,540               0,671                   0,073
 13   Pillars of Fire
      (Vohnenni stovpy)               143849          16899                0,117               8,5                      7744                  0,054                  0,458               0,781                   0,110
 14   Across The Pass
      (Cherez pereval)                50943           10278                0,201               4,9                      5772                  0,113                  0,562               0,638                   0,064
 15   Pilgrimage
      (Khresna proshcha)              89272           13995                0,156               6,4                      6977                  0,078                  0,499               0,708                   0,087
 16   Voices from above The
      Waters of Kinneret
      (Holosy z-nad vod Henisareta)   34223           8505                 0,248               4                        4868                  0,142                  0,572               0,624                   0,053
 17   I Have Not Written About
      Donbass Yet
      (Ya shche ne pysav pro
      Donbas)                         9612            3306                 0,343               2,9                      2188                  0,228                  0,662               0,511                   0,033


    The novel Vohnenni stovpy comprises the highest rate of text size (143849), the largest vocabulary of
lexemes (16899), the highest rate of hapax legomena (7744), the highest rate of concentration index of
vocabulary (0,110). Meanwhile the novel Ya shche ne pysav pro Donbas holds the lowest indicator of text
size (9612), the smallest vocabulary of lexemes (3306), the lowest rate of hapax legomena (2188), the lowest
rate of concentration index of vocabulary (0,033). Although in the novel Ya shche ne pysav pro Donbas
there is the highest rate of diversity index (0,343), the highest rate of exclusivity index of vocabulary (0,662),
as compared with the novel Vohnenni stovpy where there are the lowest indexes of diversity (0,117) and
vocabulary exclusivity (0,458).
    The index of diversity is inversely proportional to text length, the longer text is the less unique words it
potentially possesses [46, p. 143]. Hapax legomena usually cover 40-60% of text [24, p. 72]. In Roman
Ivanychuk’s texts hapax legomena index varies between 46-66%. Thus, concentation index of vocabulary
is the opposite of exclusivity index of vocabulary, that is confirmed by RITC.
    To study the peculiarities of statistical structure in Roman Ivanychuk’s text the data of RITC and UPFTC
were taken into consideration and compared (see the table 4).

Table 4
Statistical parameters in RITC and UPFTC
                  Statistical characteristics                                                                     RITC                                         UPFTC
                  Text size (N)                                                                                   1235014                                      76744330
                  Size of vocabulary of lexemes (V)                                                               49828                                        288755
                  Diversity index (V/N)                                                                           0,040                                        0,004
                   Average word frequency (N/V)                                                                   24,8                                         265,8
                   Hapax legomena (V1)                                                                            16540                                        102725
                   Exclusivity index of text (V1/N)                                                               0,013                                        0,001
                   Statistical characteristics                     RITC           UPFTC
                   Exclusivity index of vocabulary (V1/V)          0,332          0,356
                   Concentration index of text (V10t/N)            0,812          0,912
                   Concentration index of vocabulary (V10/V)       0,284          0,346

    Due to the GRAC functionality the data of words with frequency 10 and more (V10t = 1002836) and
lexemes with frequency 10 and more (V10 = 14178) are presented. It is found that in Roman Ivanychuk’s
texts hapax legomena (V1) involve 16540 words, exclusive vocabulary (33%) predominates high-frequency
vocabulary (28%) in the author’s lexicon. These data indicate the diversity and the richness of Roman
Ivanychuk’s vocabulary.
    Meanwhile in UPFTC words with frequency 10 and more cover 70051172 words, lexemes with
frequency 10 and more — 100050 lemmas, hapax legomena (V1) — 102725 words. These data mean that
exclusive vocabulary (36%) predominates high-frequency vocabulary (34%) too. The part of high-
frequency vocabulary is much higher in Ukrainian prose fiction text corpus because its size is 62 times
bigger and diversity index is 10 times lower that in Roman Ivanychuk’s text corpus, which explains the
outweigh law.
    It is known that in speech speakers give preference to a small number of units, which are of high
frequency [43]. They form the core of any speech subsystem, while most units are low frequent [37]. This
regularity was noticed by Dewey and called the outweigh law, later on, it was further researched by the
German linguist J. Zipf, who formulated the Zipf’s law, which sets the dependences [18]
    It should be noted that the larger the text corpus is the more informative it is. In RITC the indicator of
average word frequency is 25, that is each word is used, on average, about 25 times. The relatively small
number of high-frequency vocabulary (low concentration index accordingly) and relatively large number
of words with frequency 1 (therefore, high index of exclusivity) indicate a great diversity of vocabulary in
Roman Ivanychuk’s texts.
    Frequency ranking of vocabulary provides information about the core and periphery of writer’s
dictionary. Considering this, one of the stages of our study is the creation of a frequency dictionary based
on RITC and UPFTC to detect the frequency zones. Yuhan Tuldava pointed out [62, p. 65] that stratification
of vocabulary by frequency (that is identifying frequency zones of words), is important to determine text
complexity, to create minimum dictionary, to process text automatically.
    Frequency-rank patterns represent the text structure and are interpreted as manifestations of individual
preferences of linguistic personality in the choice of use of certain lexical units [34]. The vocabulary of
Roman Ivanychuk’s texts and Ukrainian prose fiction texts is divided into four zones according to the
interval of ranks (see the table 5).

Table 5
Frequency zones of vocabulary in RITC and UPFTC
                           RITC                                              UPFTC
         Frequency zone      Number of      Coverage,    Frequency zone      Number of      Coverage,
                                words           %                               words           %
  1      more than 1000      136            0,4%         more than 1000      7 331          2,4%
  2      100 — 999           1 301          2,6%         100 — 999           28 402         9,6%
  3      10 — 99             10 439         21%          10 — 99             64 329         22%
  4      1—9                 37 963         76%          1—9                 188 705        65%
              4 zone (1 — 9)


            3 zone (10 — 99)


         2 zone (100 — 999)


     1 zone (more than 1000)


                           0,00%   10,00%    20,00%     30,00%     40,00%   50,00%   60,00%   70,00%   80,00%

                                                      UPFTC      RITC

Figure 2: Frequency zones of vocabulary and their coverage in RITC and UPFTC

    As the table 5 and the figure 2 show the fourth zone in RITC (76%) is the most consistent in terms of
all contentious words, hapax legomena cover 33% the author’s vocabulary, this data manifests the
uniqueness of the author’s idiolect. The second zone includes the largest number of common / general
words, the third — words specific to literary style, the first zone consists of official and uninformative
common / general words, which serve as formal markers for text attribution [34].
    Much less coverage with high-frequency vocabulary is detected in Roman Ivanychuk’s texts as
compared to Ukrainian prose fiction texts, which makes his texts unique. A specific feature of frequency
zone in RITC is higher coverage with low-frequency vocabulary and lower coverage with high-frequency
vocabulary than in UPFTC.

5.      Discussion and conclusion

    In this study Roman Ivanychuk’s and Ukrainian prose fiction subcorpora based on GRAC utility enables
to manifest and compare quantitative characteristics and qualitative indicators of the author’s lexicon. The
author’s idiolect is studied from the point of view of lexical arsenal by means of statistical parameters that
make it possible to extract lexical markers of idiolect and identify his texts among others.
    GRAC’s functionality enables to get information on the number of word usages, word forms, lemmas
automatically, and to compile a frequency dictionary of RITC and UPFTC as well.
    Corpus-based dictionary offers an entirely new and much richer type of information, opens new
possibilities enabling comparison of one single man vocabulary with that of another and allows to solve
different problems. Methodologically, it is obvious that having more dictionaries of the type from various
time periods offers a chance to study idiolects in a principled and objective way and follow their
developments through time [9].
    As the result of the research the distribution and the frequency of parts of speech, concordance,
morphological statistical characteristics of vocabulary (index of epithetization, index of verbal definitions,
level of nominalization) and statistical parameters of vocabulary and text (text size, size of lexemes
vocabulary, diversity index, average word frequency, hapax legomena, exclusivity index of text and
vocabulary, concentration index of text and vocabulary), frequency zones of vocabulary are presented.
    The statistical characteristics of text and vocabulary allow to determine the qualitative features of the
writer’s idiolect objectively. The quantitative relations between parts of speech are an important element of
statistical text characteristics. Frequency-rank regularities represent text structure and can be interpreted as
manifestations of individual preferences of linguistic personality in the choice of certain lexical units.
    Obtained data on parts of speech, text structure, vocabulary, frequency zones in Roman Ivanychuk’s
text corpus are different from those in Ukrainian fiction prose text corpus, which, in its turn, demonstrate
the specificity of the author’s idiolect. The practical results of the study can be applied for text identification
and further research of writer’s individual language. This type of study can be used not only for idiolect
investigations, but can also serve as data in other contexts, like authorship attribution, stylometric studies,
and for literature researches.
6.      References

[1] P. Baker, Using Corpora in Discourse Analysis, A&C Black, 2006, 197 p.
[2] M. Barlow, Individual usage: a corpus-based study of idiolects, in: Proceedings of LAUD Conference.
2010.
[3] M. Barlow, Individual usage: a corpus-based study of idiolects, University of Auckland. International
Journal of Corpus Linguistics 18(4), 2013. doi:10.1075/ijcl.18.4.01bar
[4] D. Biber, S. Conrad, Register, genre, and style, Cambridge University Press (2009) 344 p.
[5] S. Buk, Distinguishing Quantitative Parameters of Author’s Language and Style (A Case of Ivan Franko
Long Prose Fiction), Visnyk Lvivskoho universytetu, Seriia filolohichna, Vyp. 70 (2019) 299–308. [S. Buk,
Distinguishing Quantitative Parameters of Author‘s Language and Style (A Case of Ivan Franko Long Prose
Fiction), Bulletin of Lviv University, Philological Series, Vol. 70 (2019) 299–308]
[6] S. Buk, Quantіtatіve analysіs of the novel Ne Spytavńy Brodu by Іvan Franko іn the Lіght of Statіstіcal
and Quantіtatіve Lіnguіstіcs, Speech and context, Іnternatіonal Journal of Lіnguіstіcs, Semіotіcs, and
Lіterary Scіence, Vol. 1(VІ) (2014) 100–112.
[7] S. Buk, Suchasni metody doslidzhennia movy pysmennyka u slovianoznavstvi, Problemy
slovianoznavstva, Vyp. 61 (2012) 86–95. [S. Buk, Modern methods of studying the writer’s language in
Slavic studies, Problems of Slavic studies, Vol. 61 (2012) 86–95]
[8] F. Čermák, Slovník Karla Čapka, Praha, Nakladatelství Lidové noviny, Ústav Českého národního
korpusu, 2007, 715 s.
[9] F. Čermák, V. Cvrček. Author Dictionaries Revisited: Dictionary of Bohumil Hrabal, Institute of the
Czech National Corpus, Charles University Prague (2010) 592–598.
[10] Yu. O Danchevska., I. M. Kulchytskyi Deiaki aspekty stvorennia korpusu khudozhnikh tvoriv V. S.
Stefanyka, MegaLing–2012 «Prykladna linhvistyka ta linhvistychni tekhnolohii», Kyiv, 2013, ss. 143–149.
[Yu. O Danchevska., I. M. Kulchytskyi, Some aspects in creation of the corpus of V.S. Stefanyk’s literary
texts, in: Proceedings of MegaLing-2012 "Applied Linguistics and Linguistic Technologies", Kyiv, 2013,
pp. 143–149]
[11] I. Danyliuk, A Zahnitko., H. Sytar, Korpus tekstiv Yuriia Shevelova: struktura, funktsii, navihatsiia,
Mova: klasychne – moderne – postmoderne, Vyp. 5 (2019) 158–169. URL:
http://nbuv.gov.ua/UJRN/Langcmp_2019_5_14 [I. Danyliuk, A Zahnitko., H. Sytar, Yuri Shevelyov’s
Corpus Texts: structure, functions, navigation, Language: classical - modern - postmodern,. Vol. 5 (2019)
158–169. URL: http://nbuv.gov.ua/UJRN/Langcmp_2019_5_14]
[12] M. Darwich, S. A. Mohd, N. Omar, N. A.Osman, Corpus-Based Techniques for Sentiment Lexicon
Generation: A Review, J. Digit. Inf. Manag., 17(5), 2019, 296 p.
[13] T. D. Demidova, Periferiynaya chast leksikona kak pokazatel literaturnoy maneryi pisatelya (na
materiale liricheskih miniatyur V.P. Detkova), Vestnik LGU im. A.S. Pushkina, № 2, 2011. URL:
http://cyberleninka.ru/article/n/periferiynaya-chast-leksikona-kak-pokazatel-literaturnoy-manerypisatelya-
na-materiale-liricheskih-miniatyur-v-p-detkova [T. D. Demidova, Peripheral part of the lexicon as an
indicator of the literary manner of writer (based on lyrical miniatures of V. P. Detkov), Bulletin of the
Leningrad State University named after A.S. Pushkin, No. 2, 2011. (2019) 158–169. URL:
http://nbuv.gov.ua/UJRN/Langcmp_2019_5_14]
[14] N. Dittmar, Explorations in ‘Idiolects’, Amsterdam Studies in the Theory and History of Linguistic
Science Series 4 (1996) 109–128.
[15] O. Halvani, Ch. Winter, L. Graner, Assessing the Applicability of Authorship Verification Methods,
in: Proceedings of the 14th International Conference on Availability, Reliability and Security, No. 38,
2019, pp. 1–10. URL: https://doi.org/10.1145/3339252.3340508.
[16] N. Hrytsiv, I. Kulchytskyy, O. Rohach, Quantitative Comparative Analysis in Parallel Translation
Corpus: building author’s and translator’s statistical profiles: (a case study of Lucy Maud Montgomery),
in: Proceedings 2020 IEEE 15th International Conference on Computer Sciences and Information
Technologies (CSIT), 2020, pp. 255–258. doi: 10.1109/CSIT49958.2020.9321893
[17] N. Hrytsiv, T. Shestakevych, J. Shyyka, Quantitative Parameters of Lucy Montgomery's Literary
Style, in CEUR Workshop Proceedings, Vol. 2870, 2021, pp. 670–684.
[18] A. G. Jivani, A Comparative Study of Stemming Algorithms, Int. J. Comp. Tech. Appl., Vol. 2, Issue
6 (2011) 1930–1938.
[19] Yu. O. Kalymon, Strukturno-informatsiina model slovnyka movy novel Vasylia Stefanyka, dys. …
kand. filol. nauk, Lviv, 2020, 312 s. [Yu. O. Kalymon, Structural-informational dictionary model of Vasyl
Stefanyk’s short stories language, Thesis for a Candidate Degree in Philology, Lviv, 2020, 312 p.].
[20] М. Khokhlova, Yssledovanye leksyko-syntaksycheskoy sochetaemosti v russkom yazyike s
pomoshchiyu statystycheskykh metodov (na baze korpusov tekstov), avtoref. dys. na soysk. uch. step. kand.
fylol. nauk, “Prykladnaya i matematycheskaya lynhvystyka”, Sankt-Peterburg (2010) 218 s. [М.
Khokhlova, The study of lexical and syntactic collocatibility in the Russian language using statistical
methods (based on Text Corpus), Sankt-Peterburg, (2010) 218 p.]
[21] I. Khomytska, V. Teslyuk, Authorship and Style Attribution by Statistical Methods of Style
Differentiation on the Phonological Level, volume 871 of Advances in Intelligent Systems and Computing
III, AISC, Springer, 2019, pp. 105-118.
[22] I. Khomytska, V. Teslyuk, A. Holovatyy, O. Morushko, Development of methods, models, and
means for the author attribution of a text, volume 3(2-93) of Eastern-European Journal of Enterprise
Technologies, 2018, pp. 41-46.
[23] R. Köhler, G. Altmann, Aims and Methods of Quantitative Linguistics, Problems of Quantitative
Linguistics, Chernivci (2005) 12–42.
[24] A. Kornai, Mathematical Linguistics, London, Springer, XIII, 2008, 289 p.
[25] T. Kosmeda, A. Zahnitko, Zh. Krasnobaieva-Chorna. Delineation of Linguopersonology and
Linguoaxiology, Uniwersytet im. Adama Mickiewicza w Poznaniu, Wydawnictwo Naukowe UAM,
Poznań, 2019.
[26] O. P. Kostetska, Indyvidualne movlennia avtora yak obiekt linhvistyky ta pidkhody do yoho
doslidzhennia, Naukovi zapysky, Natsionalnyi universytet Ostrozka akademiia, Seriia: Filolohichna, № 49
(2014) 196–199. [O. P. Kostetska, The author's individual speech as an object of linguistics and approaches
to its research, Scientific Notes, Ostroh Academy National University, Philological Series, No. 49 (2014)
196–199]
[27] I. M. Kulchytskyi, Unormuvannia tekstu pid chas dokorpusnoho opratsiuvannia: dosvid zastosuvannia.
Visnyk Natsionalnoho universytetu “Lvivska politekhnika”, Seriia: Informatsiini systemy ta merezhi, Vyp.
7 (2020) ss. 51–58. [I. M. Kulchytskyy, Text normalization during pre-corpus preparation: experience of
application, Bulletin of the National University "Lviv Polytechnic", Series: Information systems and
networks, Vol. 7 (2020) pp. 51–58]
[28] I. Kulchytskyi, U. Shandruk, The quantitative research of scientific texts at the symbolic level. In:
Computational linguistics and intelligent systems, vol 2 (2018) 71–80.
[29] K. Lagutina et al., A Survey on Stylometric Text Features, 25th Conference of Open Innovations
Association (FRUCT), 2019, pp. 184-195. doi: 10.23919/FRUCT48121.2019.8981504.
[30] G. Lakoff, The Contemporary Theory of Metaphor. In Metaphor and Thought, Cambridge, Cambridge
University Press (1998) 202–249.
 [31] V. V. Levitskiy, Kvantitativnyie metodyi v lingvistike. Nova Kniga, Vinnitsa (2007) 264 s. [Levitsky
V.V. Quantitative methods in linguistics. In: Nova Kniga, Vinnitsa, 264 p. (2007)
[32] N. Lototska, Statistical analysis of collocations of the concept joy in R. Ivanychuk’s text corpus,
Scientific Journal of Polonia University, Vol. 37 No 6. (2019) 92–98.]
[33] N. Lototska, Statistical Research of the Colour Component ЧОРНИЙ (BLACK) in R. Ivanychuk’s
Text Corpus, in: Proceedings of the 5th International Conference on Computational Linguistics and
Intelligent Systems (COLINS 2021),Vol. I, Lviv, Ukraine, 2021, pp. 486–497.
[34] N. Ya. Lototska, Idiolekt Romana Ivanychuka: korpusnobazovanyi ta linhvokohnityvnyi pidkhody,
Dysertatsiia na zdobuttia naukovoho stupenia doktora filosofii za spetsialnistiu 035 — Filolohiia,
Natsionalnyi universytet «Lvivska politekhnika», Lviv, 2021. [N. Ya. Lototska, The idiolect of Roman
Ivanychuk: corpus-based and linguo-cognitive approaches, Ph.D. thesis, specialty 035 Philology, Lviv
Polytechnic National University, Lviv, 2021]
[35] V.Lytvyn, V. Vysotska, I.Budz, Ya. Pelekh, N. Sokulska, Development of the Quantitative Method
for Automated Text Content Authorship Attribution Based on the Statistical Analysis of N-grams
Distribution, 2019 DOI: 10.15587/1729-4061.2019.186834
[36] V. Lytvyn, V. Vysotska, P. Pukach, Z. Nytrebych, I. Demkiv, A. Senyk, O. Malanchuk, S. Sachenko,
R. Kovalchuk, N. Huzyk, Analysis of the developed quantitative method for automatic attribution of
scientific and technical text content written in Ukrainian, volume 6(2-96) of Eastern-European Journal of
Enterprise Technologies, 2018, pp. 19-31. DOI: 10.15587/1729-4061.2018.149596
[37] V. Lytvyn, V. Vysotska, Y. Burov, O. Veres, I. Rishnyak, The Contextual Search Method Based on
Domain Thesaurus, Advances in Intelligent Systems and Computing. (2017) 310–319. doi:
https://doi.org/10.1007/978-3-319-70581-1_22
[38] M. Mahlberg, P. Stockwell, J. Joode, C. Smith, M. O’Donnell, CLiC Dickens: novel uses of
concordances for the integration of corpus stylistics and cognitive poetics. URL:
https://research.birmingham.ac.uk/portal/files/38225413/ cor_2E2016_2E0102.pdf
[39] S. Mollin,“I entirely understand” is a Blairism: The methodology of identifying idiolectal collocations.
International Journal of Corpus Linguistics, 14(3) (2009) 367–392. DOI: https://doi.org/10.1075/
ijcl.14.3.04mol
[40] S. T. Mubin, S. P. Rajesh, Authorship Identification with Multi Sequence Word Selection Method, in:
Thermal Stresses—Advanced Theory and Applications, 2019, pp. 653–661.
[41] M. Yu. Muhin, Kontseptualnyie profili proizvedeniy M. Bulgakova, V. Nabokova, A. Platonova i M.
Sholohova (po dannyim sopostavitelnogo analiza chastotnoy leksiki), Vestnik BFU im. I. Kanta, № 8, 2010.
URL:         http://cyberleninka.ru/article/n/kontseptualnye-profili-proizvedeniy-m-bulgakova-v-nabokova-
aplatonova-i-m-sholohova-po-dannym-sopostavitelnogo-analiza-chastotnoy [M. Yu. Mukhin, Conceptual
profiles of texts by M. Bulgakov, V. Nabokov, A. Platonov and M. Sholokhov (according to the
comparative analysis of frequency vocabulary), Bulletin of the BFU named after I. Kanta, No. 8, 2010.
URL:         http://cyberleninka.ru/article/n/kontseptualnye-profili-proizvedeniy-m-bulgakova-v-nabokova-
aplatonova-i-m-sholohova-po-dannym-sopostavitelnogo-analiza-chastotnoy]
[42] G. V. Napreenko, Internet-dnevniki i problema identifikatsii lichnosti, Yurislingvistika 11, Pravo kak
diskurs, tekst i slovo, pod red. N. D. Goleva, K. I. Brineva, Kemerovo, Izd-vo Kemerovskogo
gosudarstvennogo universiteta (2011) 480–492. [G. V. Napreenko, Internet diaries and the problem of
personal identification, Jurislinguistics 11, Law as discourse, text and word, ed. N. D. Goleva, K. I. Brinev,
Kemerovo, Publishing House of Kemerovo State University (2011) 480–492]
[43] O. Naum, L. Chyrun, V. Vysotska, O. Kanishcheva, Intellectual system design for content formation,
in: 12th International Scientific and Technical Conference on Computer Sciences and Information
Technologies (CSIT), 2017. doi: https://doi.org/10.1109/stc-csit.2017.8098753
[44] Yu. N. Pavlov, E. A. Tihomirova, Otsenka ustoychivosti vo vremeni chastotnyih slovarey avtorov v
zadachah identifikatsii tekstov, Nauka i obrazovanie, № 12, 2011. URL: http://cyberleninka.ru/article/n/77-
30569-274006-otsenka-ustoychivosti-vo-vremenichastotnyh-slovarey-avtorov-v-zadachah-identifikatsii-
tekstov [Yu. N. Pavlov, E. A. Tikhomirova, Time stability estimation of authors' frequency dictionaries in
text identification problems, Science and Education, No. 12, 2011 URL: http://cyberleninka.ru/article/n/77-
30569-274006-otsenka-ustoychivosti-vo-vremenichastotnyh-slovarey-avtorov-v-zadachah-identifikatsii-
tekstov]
[45] O. O. Pavlychko, Shchodo statystychnykh parametriv avtorskoho styliu (na materiali tvoriv E.M.
Remarka), Movni i kontseptualni kartyny svitu, VPTs «Kyivskyi un-t», Kyiv, Vyp. 29 (2010) 186–191. [O.
O. Pavlychko, Regarding the statistical parameters of the author's style (based on the texts of E.M.Remark),
Linguistic and conceptual worldview, PPC Kyiv University, Kyiv, Vol. 29 (2010) 186–191]
[46] V. S. Perebyinis., M.P. Muravytska., N. P. Darchuk Chastotni slovnyky ta yikh vykorystannia, Kyiv,
1985, 204 s. [V.S. Perebyinis., M. P. Muravytska., N. P. Darchuk, Frequency dictionaries and their use,
Kyiv, 1985, 204 p.]
[47] V. I. Perebyinis, Shcho daie statystyka movoznavtsiam?, Visnyk Kyivskoho linhvistychnoho
universytetu, Seriia Filolohiia, Kyiv, Vyd. tsentr KNLU, T. 6, № 2. (2003) 27–32. [V. I. Perebyynis, What
does statistics give to linguists?, Bulletin of Kyiv Linguistic University, Philology Series, Kyiv, Publishing
House KNLU, Vol. 6, No. 2 (2003) 27–32]
[48] O. Perelomova, Idiostyl Valeriia Shevchuka, dys. … kand. filol. nauk, 10.02.01, Sumy, 2002. 177 s.
[O. Perelomova, Valery Shevchuk’s Idiostyle, Ph.D. thesis, 10.02.01, Sumy, 2002. 177 p.]
[49] N. Pylypiuk, O Ilnytzkyj., S. Kozakov, Online Concordance to the Complete Works of Hryhorii
Skovoroda, 2013. URL: http://www.arts.ualberta.ca/~ukr/skovoroda/NEW/index.php?glang
[50] S. Raj, B. Kannan, and V. P. Jagathy Raj, Significance of Network Properties of Function Words in
Author Attribution, Intelligent Data Engineering and Analytics, Springer, Singapore, 202, pp. 171-181.
https://doi.org/10.1007/978-981-15-5679-1_17
[51] A. Rovenchak, S. Buk, Part-of-speech sequences in literary text: Evidence from Ukrainian, Journal
of Quantitative Linguistics, Vol. 25, No. 1, (2018) 1–21. doi:
https://doi.org/10.1080/09296174.2017.1324601
[52] M. Ruszkowski, Statystyka w badaniach stylistyczno-składniowych, Kielce, Wydawnictwo
Świętorszyskiej, 2004, 144 s.
[53] O. Seminck, Ph. Gambette, D. Legallois, T. Poibeau, The Corpus for Id iolectal Research (CIDRE),
Journal of Open Humanities Data, Ubiquity Press, 7, 2021, pp. 15.
[54] M. Shvedova, The General Regionally Annotated Corpus of Ukrainian (GRAC, uacorpus.org):
Architecture and Functionality, in: Proceedings of the 4th International Conference on Computational
Linguistics and Intelligent Systems, COLINS 2020, Vol. I, Lviv, Ukraine (2020) pp. 489–506.
[55] M. Shvedova, R. von Waldenfels, S. Yarygin, A. Rysin, V. Starko, M. Woźniak, M. Kruk et al. GRAC:
General Regionally Annotated Corpus of Ukrainian, 2017–2021. URL: http://uacorpus.org/
[56] J. Sinclair Corpus, Concordance, Collocation, Oxford, Oxford University Press, 1991, 200 p.
[57] H. Sytar, Osoblyvosti realizatsii frazeolohizovanykh rechen u tvorakh Bohdana Lepkoho,
Linhvistychni studii, Vyp. 40 (1) (2020) 64–80. URL: http://nbuv.gov.ua/UJRN/lingst_2020_40(1)__7 [H.
Sytar, Peculiarities of realization of phraseologized sentences in Bohdan Lepky’s novels, Linguistic
Studies, Vol. 40 (1) (2020) 64–80. URL: http://nbuv.gov.ua/UJRN/lingst_2020_40(1)__7]
[58] A. Stefanowitsch, Corpus linguistics: A guide to the methodology, Textbooks in Language Sciences
7, Berlin, Language Science Press, 2020.
[59] M. Stubbs, Quantitative Methods in Literary Linguistics, Cambridge, Cambridge University Press
(2014) 46-62.
[60] H. Sytar, Syntaksychni frazeolohizmy v linhvopersonolohiinomu portreti Mykhaila Kotsiubynskoho,
Teoriia linhvistychnykh paradyhm: kolektyvna monohrafiia na poshanu profesora, chlen-korespondenta
NAN Ukrainy Anatoliia Zahnitka, za red. Zh. Krasnobaievoi-Chornoi, Vinnytsia: TOV «Nilan-LTD»,
2019, ss. 172–195. [H. Sytar, Syntactic Phraseologisms in the Linguo-Personological Portrait of Mykhailo
Kotsyubynsky, Theory of Linguistic Paradigms: A Collective Monograph in Honor of Professor Anatoliy
Zagnitko, Corresponding Member of the National Academy of Sciences of Ukraine, ed. J. Krasnobayeva-
Chorna, Vinnytsia, Nilan-LTD LLC, 2019, pp. 172–195]
[61] H. V. Sytar, Syntaksychni frazeolohizmy v linhvopersonolohiinomu portreti Yuriia Shevelova (na
materiali korpusu tekstiv Yuriia Shevelova), Linhvistychni studii, Vyp. 37 (2019) 130–134. [H. V. Sytar,
Syntactic phraseology in Yuri Shevelyov’s linguo-personological portrait (on the material of Yuri
Shevelyov’s corpus texts), Linguistic Studies, Vol. 37 (2019) 130–134]
[62] Yu. A. Tuldava, Problemyi i metodyi kvantitativno-sistemnogo issledovaniya leksiki. Tallin, Valgus,
1987, 204 s. [Yu. A. Tuldava, Problems and methods of quantitative-systemic research of vocabulary,
Tallinn, Valgus, 1987, 204 p.]
[63] A. I. Vehesh, Tradytsii ta novatorstvo ukrainskoi literaturno-khudozhnoi antroponimii posttotalitarnoi
doby, dys. … kand. filol. nauk, 10.02.01, Ivano-Frankivsk, 2010. 273 s. [A.I. Vehesh, Traditions and
innovations of Ukrainian literary and artistic anthroponymy of the post-totalitarian era, Ph.D. thesis,
10.02.01, Ivano-Frankivsk, 2010. 273 p.]
[64] V. I. Voloshuk, Linhvostylistychni osoblyvosti idiolektu Z. Lentsa v malykh epichnykh zhanrakh:
avtoref. dys. … kand. filol. nauk, 10.02.04, Lviv, 2004. 20 s. [V.I. Voloshuk, Linguistic and stylistic
features of Lenz's idiolect in small epic genres, Ph.D. thesis, 10.02.04, Lviv, 2004. 20 p.]
[65] V. Vysotska, V. Lytvyn, V. Kovalchuk, S. Kubinska, M. Dilai, B. Rusyn, L. Pohreliuk., L. Chyrun, S.
Chyrun, O. Brodyak, Method of similar textual content selection based on thematic information retrieval,
in: CSIT, Proceedings of the XIVth Scientific and Technical Conference, Lviv, 2019 pp. 1–6.
[66] W. Wimmer, G. Altmann, Review Article: On vocabulary richness, Journal of Quantitative
Linguistics, Vol. 6, No. 1.(1999) 1–9.
[67] A. P. Zahnitko, Teoriia hramatyky i tekstu, Donetsk, 2014, 480 s. [A. P. Zahnitko, Theory and
Grammar of the Text, Donetsk, 2014, 480 p.]
[68] O. Zuban, Lexicographical Database of Frequency Dictionaries of Morphemes Developed on the Basis
of the Corpus of Ukrainian Language, in: Advances in Intelligent Systems and Computing IV, CSIT 2019,
vol. 1080, Springer, Cham, 2020, doi: 10.1007/978-3-030-33695-0_37.