<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>T. Shestakevych);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Quantitative characteristics of the author's idiostyle</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tetiana Shestakevych</string-name>
          <email>Tetiana.v.shestakevych@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuliia Shyika</string-name>
          <email>yuliia.i.shyika@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Larysa Tsiokh</string-name>
          <email>larysa.y.tsokh@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLW-2024: Computational Linguistics Workshop at 8th International Conference on Computational Linguistics and Intelligent Systems</institution>
          ,
          <addr-line>CoLInS-2024</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>28а,Stepana Bandery St., Building 5, Room 407, Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>The field of corpus linguistics and the significance of text corpora in linguistic research has been explored in the article. The article examines the various classifications of linguistic corpora based on factors such as linguistic data type, parallelism, literature, and purpose of creation. Furthermore, it highlights the parameters and criteria for creating high-quality linguistic corpora, including sufficiency, consistency, reproducibility, correctness, and technologic ability. The article presents a case study on the corpus of M. Yatskiv's works, discussing its typological and applicative characteristics. Finally, it provides quantitative characteristics and linguistic statistical analysis of the research corpus, offering insights into vocabulary volume, word forms, vocabulary richness, and word repetition. Overall, the value of text corpora in linguistic research has been highlighted and practical examples for analysis of the author's idiostyle has been provided.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Corpus linguistics</kwd>
        <kwd>text corpora</kwd>
        <kwd>software tools</kwd>
        <kwd>statistical analysis</kwd>
        <kwd>idiostyle 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The fundamental features of a text corpus are its machine readability, which requires
electronic forms and specific data coding systems, and its representativeness. There are
various definitions of a corpus that emphasize the importance of these two features, such
as "a collection of machine-readable texts that fully represents the language and its
diversity," "a large number of natural language texts in digital form used for linguistic
research," where "natural" means everything that has been expressed in oral or written
form"; "written and spoken texts which are in one way or another representative for a
language and are presented as an electronic database". To these criteria, N. Dash and B.
Chaudhuri [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] add the parameter of corpus applicability in linguistic research, defining a
corpus as "a collection of linguistic data composed of either written texts or transcribed
spoken texts, the main purpose of which is to test hypotheses about language".
      </p>
      <p>
        Within the field of corpus linguistics, various theoretical definitions exist regarding the
nature of a corpus. However, J. Sinclair provides a brief and functional definition of an
electronic text corpus. According to J. Sinclair, a corpus refers to a collection of carefully
selected and appropriately ordered text passages or fragments that serve as a
representative sample of language [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In brief, a text corpus refers to an electronic compilation of written and spoken texts in
any natural language that is methodically structured to meet certain mandatory
requirements and intended to facilitate scientific research on language. The nature and
scope of text corpora may vary significantly, depending on factors such as their intended
use, structure, selection principles, volume, and presentation format, among others. As
such, text corpora are distinguishable from databases and other similar resources, and
may differ widely from one another.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>In contemporary linguistics, corpus-based studies have become a significant area of
research, with numerous monographs, scientific articles, and textbooks on corpus
technologies in foreign and domestic linguistics. Corpus-based studies hold an important
place in world linguistics. Leading figures in corpus linguistics, such as G. Leech, D. Biber, J.
M. Sinclair, S. Th. Gries, S. Granger, T. McEnery, P. Baker, P. W. Hanks, among others, have
made groundbreaking contributions to the development and application of corpus-based
methodologies. In the Ukrainian context, several researchers have significantly
contributed to corpus-based studies, enriching the field with insights specific to the
Ukrainian language and its usage. Notable Ukrainian scholars in corpus linguistics include
S. Buk, N. Darchuk, O. Demska, V. Zhukovska, A. Zahnitko, I. Danyliuk, H. Sytar, V. Shyrokov,
I. Kulchytskyi, and others.</p>
      <p>The key topics of interest in this field can be broadly categorized into several areas,
including an analytical review of discussions and foreign publications regarding the place
of corpus technologies and corpus linguistics in modern linguistics, an overview of corpus
linguistics and the history of its formation, a discussion of what a corpus of texts involves,
its defining features, approaches to the classification of corpora as well as the branches
and methods of their use. Additionally, the concept of the national text corpus, its
prerequisites and principles of planning and compilation, the Ukrainian National
Linguistic Corpus with a volume of more than 100 million word uses, which was created in
the Ukrainian National Linguistic Fund of the National Academy of Sciences, the basic
principles and perspectives of the research corpus of the Ukrainian language, some
aspects of the creation and use of specific research corpora, and the technical aspects of
preparing texts for further corpus research has been reviewed.</p>
      <p>
        Various above-mentioned scholars have proposed classification of linguistic corpora
based on several factors. These include the type of linguistic data (written, oral, or mixed),
parallelism (monolingual, bilingual, or multilingual), literature (literary, dialectal,
colloquial, terminological, or mixed), the purpose of creation (multipurpose and
specialized), genre (fictional, folklore, dramatic, or journalistic), availability (free,
commercial, or closed), purpose (research and illustrative), dynamism (dynamic and
static), tagging (tagged and not tagged), type of tagging (morphological, semantic,
syntactic, or others), and volume of text (full-text or fragmented). However, some
researchers suggest simplifying the classification system and recognizing the following
categories: specialized, reference, multilingual, parallel, educational, diachronic, and
mentoring [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It is important to note that these classifications aid in organizing and
categorizing linguistic corpora for various purposes, such as research and analysis.
      </p>
      <p>O. Demska-Kulchytska argues that corpora could be identified as:









</p>
      <p>Full-text (full texts are included in the corpus)
Fragmentary (only text fragments are included)
Exploratory (used in linguistic research to formulate new theories and concepts)
Illustrative (used to confirm existing theories or hypotheses about language)
Monitoring/dynamic (provide the possibility of observing changes in the language,
taking into account the aspect of diachrony)
Statistical (show the state of the language at a particular time period)
Diachronic (represent the language in several time lapses)
Synchronic (represent the language or text of a certain defined period of time)
General (represent the national language)</p>
      <p>
        Specialized (aimed at solving specific research tasks) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Within the field of corpus linguistics, there exists a distinct category of multilingual
linguistic corpora, parallel corpora, and comparative linguistic corpora, which hold
significant value for scholars engaged in translation studies. These resources allow the
effective analysis and comparison of language across diverse contexts and languages,
enabling researchers to gain a deeper understanding of linguistics in practice [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
      </p>
      <p>The representativeness of a corpus is a significant feature, “which means the ability of
the corpus to reflect all the properties of the subject field, by which we understand the
linguistic system's implementation level, which comprises linguistic phenomena subject to
description; authenticity involves the selection of written or spoken text(s), excerpt(s) of
text(s) created by the native speaker(s) in the process of real communication. This
criterion is an essential component of empiricization, ensuring the material's authenticity.
Additionally, selectivity is necessary to limit the material by selecting specific speech
fragments, while balance involves introducing a proportional number of textual resources
into the corpus.</p>
      <p>
        Linguistic corpora are characterized by four main parameters. Firstly, the corpus size
should be significant enough to be representative of the subject field. Secondly, it should
be structured and tagged for efficient use. Thirdly, the texts included in the corpus must be
digitized for ease of access and analysis. Fourthly, the concept of “electronic corpus”
includes special software for working with this corpus. These parameters are essential to
ensure the corpus' quality and effectiveness in linguistic analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        V. Shyrokov outlines the selection criteria applied during the creation of the Ukrainian
National Corpus. These included the diachronic aspect, which determined the selection of
texts across time periods, as well as the stylistic aspect, which aimed to represent the
substyles of the national language. Additionally, the territorial aspect was considered,
taking into account the specificity of the literary language in different regions of Ukraine
and the fact that the Ukrainian language can be used to create literary oral or written texts
outside of Ukraine. Finally, the quantitative aspect was also taken into account, clearly
defining the number of words in each text or passage included in the corpus, as well as the
number of texts and/or passages [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and materials</title>
      <p>In corpus linguistics, users interact with the corpus through specialized software tools or
corpus managers, which offer diverse means to extract the necessary information from the
corpus. These tools enable users to conduct various types of searches, such as searching
for specific word forms, discontinuous or continuous syntagms, or word forms based on
morphological features. Additionally, users can access information on the origin or type of
text, as well as obtain lexical and grammatical statistical data. Users may also save selected
concordance lines in a separate file on their computer.</p>
      <p>
        However, the corpus alone is insufficient for accomplishing many of the tasks
aforementioned. It is also necessary for the text to contain diverse linguistic information.
This led to the development of a tagged corpus, which facilitates the acquisition of more
interesting results at the statistical level. Tagging makes it possible to count not only the
frequency of words, but also the frequency of different parts of speech [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
      </p>
      <p>
        The task of corpus annotation centers around the markup format. A linguistic corpus
that possesses at least one linguistic parameter markup is distinguishable from other
linguistic information and instrumental systems or databases. As such, specific
requirements are placed upon the technique and technology of tagging. Ideally, the
marking of corpora should occur in a unified and coordinated manner with previously
established systems of tagging electronic arrays of information, allowing for a
linguistically meaningful interpretation of introduced markers [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>Structural annotation involves selecting structural elements of the text using a
particular markup language and set of markers that indicate the external elements of text
structure. To implement structural annotation, several procedures must occur:
1. Text segmentation
2. Formalization of annotation parameters of target units
3. Creation of a tagset or set of formal codes
4. Determination of the annotation scheme and its principles</p>
      <p>
        Linguistic experts have identified several key criteria for standard corpora, including
sufficiency, consistency, reproducibility, correctness, possibility of data collection,
technologic ability, scalability, compactness, and clarity. Sufficiency refers to the need for a
wide range of structural elements that can meet most requirements. Consistency is crucial,
as the markup scheme must be based on consistent rules that enable precise identification
of tags and attributes. Reproducibility is also essential, with the coding scheme based on
clearly defined rules that enable the original text to be reproduced using simple
algorithms. Correctness is maintained through software that checks the conformity of
markups with structural specifications. Data collection is also an important criterion,
encompassing direct data collection through manual input or automatic text recognition,
as well as data coding. Technologic effectiveness is necessary to meet the needs associated
with automatic processing of texts, including the selection of text according to established
criteria, use of particular mechanisms, and type of intertext indexes. The possibility of
ranging is also critical, ensuring that any created scheme has the ability to expand.
Compactness is also a key consideration, with markup potentially affecting the file size
and the speed of text data processing. Methods of achieving compactness include tag
minimization, for example, omitting or shortening the final tag, use of specific end tags or
XML markup schemes. Finally, clarity is crucial when direct user work with the text is
required without special software support, with transparent markup essential to facilitate
this process [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13-15</xref>
        ].
      </p>
      <p>Therefore, the following marking system has been used: &lt;p&gt; – the beginning of a
paragraph; &lt;/p&gt; – the end of the paragraph; &lt;s&gt; – the beginning of a sentence; &lt;/s&gt; – the
end of the sentence; &lt;head&gt;…&lt;/head&gt; – heading.</p>
      <p>In analyzing an idiostyle, it is important to consider the structural elements of the text,
such as the title, paragraph, and sentence. The annotated corpus of M. Yatskiv's prose
offers a valuable tool for study the author's idiostyle, allowing for both qualitative and
quantitative analysis of his language. This can provide invaluable insights into the
characteristics of his works, making it a valuable source for those seeking to study and
understand the nuances of his writing.</p>
      <p>In terms of typological and applicative characteristics, the corpus of M. Yatskiv's prose
can be classified as: ‒ illustrative: it has been compiled for the purpose of
linguisticstatistical analysis of the writer's idiolect; ‒ full-text: contains the complete text of the
story "In a Clutch (Shadow Dance)" as well as 42 short stories; ‒ static: does not allow for
the ongoing addition of texts; ‒ author’s language: only texts by M. Yatskiv; ‒ monolingual:
includes texts only in Ukrainian; ‒ written: the corpus is a collection of written texts; ‒
annotated: textual data are tagged at the syntactic level.</p>
      <p>The following software has been used to establish the quantitative and qualitative
characteristics of the corpus:


</p>
      <p>Textanz, particularly its Wordforms option, which enables the determination of not
only the frequency of each word form and its length but also the variance
(deviation of the values of a random variable from the center of distribution).
Additionally, Summary option allows for determining the text's word count,
number of sentences, average sentence length, average paragraph length, average
word length, number of unique word forms, lexical diversity, lexical density,
longest/shortest sentence, longest/shortest word form, and coefficient of
readability
AntConc toolkit and its Words List option, which counts all the words in the corpus
and presents them in an ordered list
Programs coded in Python by authors to work with the corpus, for example, to
convert a list of word forms with structural marks into a list of word forms in txt
format and Excel tables, for preprocessing text arrays before corpus analysis, to
calculate the distribution of lengths of different linguistic units (words in letters,
sentences in words, etc.), and to compute statistics on the distribution of word
forms and words of the text by parts of speech, among other tasks</p>
      <p>The formal model of the process of information technologies application in this
research is presented as a Petri net (Fig. 1).</p>
      <p>
        Petri net is a mathematical abstraction, widely used for processes modelling, as it is
convenient in visualization of simultaneous and sequential tasks within a modelled
process [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. Petri net N=(I, O, P, T, W), as a position has results of a processes,
presented with transitions. Every transition, defined in Table 1, means an application of a
relevant information technology, for example, Optical Character Recognition application,
Excel, Python software, text analysis software AntConc, etc. Petri net`s marking M=(1, 0, 0,
0, 0, 0, 0, 0, 0, 0). To reach the goal of the research goal, the transitions should fire
consequentially, t1→ t2→ t3→ t4→ t5→ t6→ t7→ t8→ t9 (see Table 1 and 2).
      </p>
      <p>The purpose of creating a corpus of short stories based on the language of M. Yatskiv is
to offer empirical data for scientific research on the author's idiostyle. The transition from
traditional research methods to corpus-based ones is an evolutionary step that
necessitates the existence of electronic textual data. This data enables the creation of a
foundation for producing a dictionary of M. Yatskiv's language.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>By meeting all the requirements for creating a linguistic corpus, we have obtained a
convenient tool that can be used for work of any complexity, especially for solving the
specific objectives of our research.</p>
      <p>The story "In a Clutch" and 42 short stories by M. Yatskiv has been selected for the
study. General quantitative characteristics of the corpus are presented in Table 3.</p>
      <p>Texts for the research corpus were first digitalized, then normalized and tagged. The
AntConc program was then used to generate a list of word forms from the story "In a
Clutch." This list was subsequently transferred to MS Excel for lemmatization, reducing
word forms to their dictionary form. The total number of unique words in the story was
then calculated by sorting word forms using MS Excel's "Sorting and filtering" function
based on the "Part of speech + Lemma" criterion. The "Interim results" function in MS
Excel was then used to generate a list of subcorpus words and their frequency of use. The
same function was also employed to calculate the number of word uses, word forms, and
words by parts of speech in the subcorpus of M. Yatskiv's works. Thus, as a result of the
analysis of both subcorpora, we obtained the following results presented in Table 4.
General characteristics of the research corpus
Length of preliminary paragraphs
of</p>
      <p>preliminary
Control number of symbols
“In a Clutch”
108
2286
388201
72599
17568
10058</p>
      <p>Short stories
108
2649
219590
42963
10535
6202</p>
      <p>A linguistic statistical analysis of the research corpus of M. Yatskiv’s works has been
conducted, following the established</p>
      <p>
        methodology developed by S. N. Buk and
Kulchytskyi&amp;Tsiokh and other[
        <xref ref-type="bibr" rid="ref18 ref19 ref20 ref21 ref22">18, 19, 20, 21, 22</xref>
        ]. The main characteristics of the text
were identified by the researcher, and are as follows:
      </p>
      <p>The volume of the text, denoted as the total number of words used (N), equals 72599 in
the corpus under research.</p>
      <p>The number of word forms in the text (Vf), or the unique words used, equals 17658.</p>
      <p>The vocabulary volume (V), referring to the number of words used in the text, equals
10058 in our research corpus.</p>
      <p>The vocabulary richness or diversity index (Id) is the ratio of the vocabulary volume
(V) to the overall text volume (N), and is calculated using the formula:</p>
      <p>Id =


=
10058</p>
      <p>
        A higher index indicates a greater variety of words used. In this case, the index of 0.14
is considered high, as the average index of fiction, as calculated by S.N. Buk [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], is 0.067.
      </p>
      <p>The average repetition of a word in the text (Iwr) is inverted from the diversity index
and calculated as:</p>
      <p>Iwr =

=
72599
10058</p>
      <p>= 7.22,
where N is an overall text volume, V is the vocabulary volume.</p>
      <p>On average, each word is used approximately seven times in the text.</p>
      <p>Hapax legomena (V1) pertains to words that appear only once in a given sample, with a
frequency of 1. Our corpus of research contains a total of 5495 such words.</p>
      <p>The exclusivity index, on the other hand, measures the variability of the vocabulary,
specifically the portion of the text that comprises words that appear only once. The
exclusivity index for the vocabulary (Ien) is calculated as the ratio of the number of
lexemes with a frequency of 1 (V1) to the volume of the text (N), resulting in:


Ien =
 1
=
5495
72599
= 0.08.</p>
      <p>(2)
(3)
(4)
(5)
Iden =</p>
      <p>Ncw

=
17143
42963
= 0.4.</p>
      <p>The Automated Readability Index (ARI) was first developed in 1967 with the intention
of evaluating the readability of technical manuals and various documents. Over time, its
application has expanded to other areas. Unlike other well-known readability indices, such
as the Flesch-Kincaid, Gunning Fog Index, SMOG Index, and the Fry Readability Formula,
the ARI possesses a unique advantage, along with the Coleman-Liau index, in that it does
not rely on a specific natural language of the printed text. This is due to the fact that it does</p>
      <p>
        According to the functional styles of the Ukrainian language [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], the exclusivity index
for fiction is 0.029.
      </p>
      <p>The vocabulary concentration index represents the portion of the text that consists of
words that appear ten or more times. The vocabulary concentration index (Ivc) is
determined as the ratio of the number of words in the text with an absolute frequency of
10 or more (V10) to the total number of words in the text, giving us:</p>
      <p>The average vocabulary concentration index for fiction is 0.14/
 10</p>
      <p>949
10058
Ivc =
=
= 0.09.</p>
      <p>A low concentration index and a high number of words with a frequency of 1 (and,
consequently, a high exclusivity index) are indicative of a significant diversity in the
author's vocabulary.</p>
      <p>The lexical density index, which is closely associated with the vocabulary concentration
index, is a measure that expresses the ratio of content words (Ncw) in the text to the total
number of words (N). Texts that have fewer function words tend to be more lexically
dense. It is possible to calculate coefficients of lexical density for content words and
separately for nouns, adjectives, verbs, and adverbs:
not take into account syllables, but rather the ratio of signs in a word and the number of
sentences. The ARI formula is:


words in the text, and S represents the number of sentences in the text. It is important to
note that the higher the ARI index, the more challenging it is to comprehend the text.</p>
      <p>
        Additionally, the ratio of parts of speech in a given text can serve as a statistical
parameter of an individual author's style and a characteristic feature of a particular work
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. The research corpus has been subjected to morphological tagging, using the classical
division of words into parts of speech, and the frequency of each part of speech in the text
has been automatically obtained, as shown in Table 5.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>Word
use
176
13603
8186
20218
8066
6073
4863
5355
4839
1220
72599</p>
      <p>%
The analysis of M. Yatskiv's works reveals that verbs and nouns are the most frequently
used parts of speech, accounting for 18.74% and 21.48% respectively in both the novel
and the writer's short stories. Function words, on the other hand, show the highest level of
activity (25.39% and 34.14% respectively). Pronouns are also used in significant amounts,
amounting to 11.28% and 11.5%. Adjectives and adverbs are almost equal in M. Yatskiv's
short stories, with 8.44% and 8% respectively, while adjectives are more prevalent in the
novel at 11.11% and 6.70%. Numerals, on the other hand, are the least used, having only
1.68% and 0.75% in both texts. Numerals are the least numerous in both texts (1.68% and
0.75%).</p>
      <p>
        In linguistic statistics, it is common to calculate the quantitative relations between
parts of speech, considering them as one of the components of the statistical
characteristics of the text. These relations include the index of nominal modifiers (Inat),
which measures the ratio of the sum of noun uses (Vn) to the sum of adjective uses (Vadj),
and the index of verbal modifiers (Ivat), which measures the ratio of the sum of adverb
uses (Vadv) to the sum of verbs uses (Vv) [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Additionally, the degree of nominality
(Inom) is also considered, measuring the ratio of the sum of noun uses (Vn) to the sum of
verb uses (Vv).
      </p>
      <p>The indices of epithetization, nominalization, and verbal modifiers serve as an
important supplement to the qualitative analysis although they are not the defining
characteristics of the stylistic interpretation of the text. The quantitative relations of parts
of speech in M. Yatskiv's novel "In a Clutch" demonstrates the author's frequent use of
nominal and verbal modifiers, as well as a high degree of nominality when compared to
the average figures of fiction. As a result, several coefficients have been calculated to
quantitatively characterize the lexical level of M. Yatskiv's works of fiction in various ways
(see Table 6).</p>
      <p>To determine the significance or insignificance of the statistical difference between the
coefficients of the author's long prose and short stories, χ2 has been calculated, which is
also known as the homogeneity criterion in quantitative linguistics. To calculate the
criterion of homogeneity, it is necessary to have a certain number of indicators for each
sample. This involves constructing a table with a number of rows equal to the number of
samples and a number of columns equal to the number of indicators to be compared.
Based on the results of our research, the resulting table 7 is as follows:</p>
      <p>In order to determine whether χ2 indicates a significant difference, it is necessary to
refer to the table of critical values of χ2. This involves assessing the number of degrees of
freedom, which in this particular case is f = 8. If the calculated value of χ2 is greater than
the table value for the given significance level, the difference is considered significant. In
our case, 0.45 is significantly less than the smallest number in the series. This indicates
that the difference in statistical indicators characterizing the lexical level of M. Yatskiv's
short stories and novel is statistically insignificant and therefore allowable. It can be
concluded that a common idiostyle is present, uniting works under research.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>Linguistic corpora play a vital role in linguistic research, providing a systematic and
structured collection of written and spoken texts in electronic form. These corpora allow
for the application of mathematical modeling and computer-informational approaches to
analyze language data and draw insightful conclusions. The definition and criteria for a
corpus have evolved over time, emphasizing the importance of machine readability,
representativeness, standardization, and other features. Various classifications of
linguistic corpora have been proposed based on factors such as the type of linguistic data,
parallelism, literature, purpose of creation, genre, availability, and more.</p>
      <p>The creation of a linguistic corpus requires careful selection of texts based on
diachronic, stylistic, territorial, and quantitative aspects. The corpus of M. Yatskiv's prose
serves as a valuable resource for studying the author's idiostyle, offering both qualitative
and quantitative analysis of language. The corpus is classified as illustrative, full-text,
static, author's language, monolingual, written, and annotated.</p>
      <p>Quantitative characteristics of the research corpus, such as word uses, word forms, and
parts of speech, have been analyzed for the story "In a Clutch" and 42 short stories by M.
Yatskiv. Digitalization, normalization, tagging, and analysis using software tools like
AntConc and MS Excel have facilitated data processing and statistical analysis. The
linguistic statistical analysis of the research corpus has provided insights into the volume
of the text, the number of word forms, vocabulary volume, and vocabulary richness.</p>
      <p>Overall, linguistic corpora and their analysis offer valuable resources and
methodologies for studying language, enabling researchers to gain a deeper
understanding of linguistic phenomena, language variation, and idiolects. These corpora
provide a solid foundation for empirical research and facilitate the development of
linguistic theories and concepts.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Demska</surname>
          </string-name>
          ,
          <article-title>Tekstovyi korpus: Ideia inshoi formy</article-title>
          .
          <source>VPTs NaUKMA</source>
          , Kyiv,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <article-title>Using Text Corpora for Understanding Polysemy in Bangla</article-title>
          ,
          <source>in: Proceedings of the Language Engineering Conference</source>
          , IEEE, Hyderabad, India, 13
          <source>December</source>
          <year>2002</year>
          , pp.
          <fpage>99</fpage>
          -
          <lpage>109</lpage>
          . doi:
          <volume>10</volume>
          .1109/LEC.
          <year>2002</year>
          .
          <volume>1182297</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sinclair</surname>
          </string-name>
          , Developing Linguistic Corpora: a Guide to Good Practice,
          <year>2004</year>
          . URL: https://users.ox.ac.uk/~martinw/dlc/chapter1.htm
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hardie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>McEnery</surname>
          </string-name>
          ,
          <article-title>A Glossary of Corpus Linguistics</article-title>
          . Edinburgh University Press,
          <year>Edinburgh 2006</year>
          . doi:
          <volume>10</volume>
          .1515/9780748626908.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Demska-Kulchytska</surname>
          </string-name>
          ,
          <article-title>Deshcho pro klasyfikatsiiu tekstovykh korpusiv, Naukovi zapysky Ternopilskoho derzhavnoho pedahohichnoho universytetu im</article-title>
          . V. Hnatiuka,
          <volume>1</volume>
          (
          <year>2004</year>
          )
          <fpage>153</fpage>
          -
          <lpage>57</lpage>
          . URL: http://ekmair.ukma.edu.ua/handle/123456789/1704.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Baker</surname>
          </string-name>
          , Corpora in Translation Studies, Target.
          <source>International Journal of Translation Studies</source>
          <volume>7</volume>
          (
          <year>1995</year>
          )
          <fpage>223</fpage>
          -
          <lpage>43</lpage>
          . doi:
          <volume>10</volume>
          .1075/target.7.2.03bak.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Barth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stefan</surname>
          </string-name>
          , Understanding Corpus Linguistics, Routledge,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Demianchuk</surname>
          </string-name>
          ,
          <article-title>Riznovydy korpusu tekstiv u protsesi perekladu dokumentiv ofitsiinodilovoho styliu</article-title>
          .
          <source>Naukovyi visnyk DDPU imeni I. Franka. Seriia «Filolohichni nauky»: Movoznavstvo</source>
          <volume>1</volume>
          (
          <year>2016</year>
          )
          <fpage>104</fpage>
          -
          <lpage>07</lpage>
          . URL: http://ddpufilolvisnyk.com.ua/uploads/arkhiv-nomerov/
          <year>2016</year>
          /NV_
          <year>2016</year>
          _
          <fpage>5</fpage>
          -
          <lpage>1</lpage>
          /27.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Shyrokov</surname>
          </string-name>
          (Ed.),
          <article-title>Korpusna linhvistyka</article-title>
          . Dovira, Kyiv,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>I. Kulchytskyi</surname>
          </string-name>
          ,
          <article-title>Tekhnolohichni aspekty ukladannia korpusiv tekstiv</article-title>
          , in: O.
          <string-name>
            <surname>Levchenko</surname>
          </string-name>
          (Ed.),
          <article-title>Dani tekstovykh korpusiv u linhvistychnykh doslidzhenniakh, Vydavnytstvo Lvivskoi politekhniky</article-title>
          , Lviv,
          <year>2015</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>I. Kulchytskyi</surname>
          </string-name>
          ,
          <article-title>Unormuvannia tekstu pid chas dokorpusnoho opratsiuvannia: Dosvid zastosuvannia</article-title>
          ,
          <source>Visnyk Natsionalnoho universytetu «Lvivska politekhnika» 7</source>
          (
          <year>2020</year>
          )
          <fpage>51</fpage>
          -
          <lpage>58</lpage>
          . doi:
          <volume>10</volume>
          .23939/sisn2020.
          <fpage>07</fpage>
          .051.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>M.-L. Merten</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wever</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Geierhos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Tophinke</surname>
          </string-name>
          , Eyke Hüllermeier,
          <article-title>Annotation uncertainty in the context of grammatical change</article-title>
          ,
          <source>International Journal of Corpus Linguistics</source>
          ,
          <volume>28</volume>
          :3 (
          <year>2023</year>
          )
          <fpage>430</fpage>
          -
          <lpage>459</lpage>
          . doi:
          <volume>10</volume>
          .1075/ijcl.20113.mer
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mesch</surname>
          </string-name>
          ,
          <article-title>Creating a multifaceted corpus of Swedish Sign Language</article-title>
          , in: E. Wehrmeyer (Ed.),
          <source>Advances in Sign Language Corpus Linguistics</source>
          , John Benjamins, Amsterdam,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1075/scl.108.
          <year>09mes</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Buk</surname>
          </string-name>
          ,
          <article-title>Velyka proza Ivana Franka: elektronnyy korpus, chastotni slovnyky ta inshi mizhdystsyplinarni konteksty</article-title>
          , Lviv Ivan Franko University, Lviv,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Khomytska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Teslyuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kryvinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Bazylevych</given-names>
            ,
            <surname>Software-Based Approach Towards Automated Authorship Acknowledgement -</surname>
          </string-name>
          Chi‐Square Test on One Consonant Group, in: Electronics, Vol.
          <volume>7</volume>
          :
          <issue>1138</issue>
          ,
          <year>July 2020</year>
          , pp.
          <source>doi: 10</source>
          .3390/electronics9071138. URL: https://www.mdpi.com/2079-9292/9/7/1138
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shestakevych</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Volkov,</surname>
          </string-name>
          <article-title>The criteria for choosing the optimal solution under the uncertainty in project management (</article-title>
          <year>2021</year>
          ) CEUR Workshop Proceedings,
          <volume>2851</volume>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gozhyj</surname>
          </string-name>
          , I. Kalinina,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shiyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nechakhin</surname>
          </string-name>
          ,
          <article-title>Building a Fuel Measurement System Model based on Colored Petri Nets (</article-title>
          <year>2023</year>
          )
          <article-title>International Scientific</article-title>
          and Technical Conference on
          <source>Computer Sciences and Information Technologies. doi: 10.1109/CSIT61576</source>
          .
          <year>2023</year>
          .
          <volume>10324266</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kulchytskyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tsiokh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Malaniuk</surname>
          </string-name>
          , Quantitative Equivalence Level in Poetry Translation,
          <source>in: Proceedings of the 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>54</lpage>
          . doi:
          <volume>10</volume>
          .1109/stc-csit.
          <year>2018</year>
          .
          <volume>8526715</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Paquot</surname>
          </string-name>
          , L. Plonsky,
          <article-title>Quantitative research methods and study quality in learner corpus research</article-title>
          ,
          <source>International Journal of Learner Corpus Research</source>
          <volume>1</volume>
          :
          <issue>3</issue>
          (
          <year>2017</year>
          )
          <fpage>61</fpage>
          -
          <lpage>94</lpage>
          . URL: http:// hdl.handle.net/
          <year>2078</year>
          .1/185993
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Seifart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mundry</surname>
          </string-name>
          ,
          <article-title>Quantitative Comparative Linguistics based on Tiny Corpora: Ngram Language Identification of Wordlists of Known and Unknown Languages from Amazonia and Beyond</article-title>
          ,
          <source>Journal of Quantitative Linguistics</source>
          <volume>22</volume>
          (
          <year>2015</year>
          )
          <fpage>202</fpage>
          -
          <lpage>214</lpage>
          . doi:
          <volume>10</volume>
          .1080/09296174.
          <year>2015</year>
          .1037161
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>V.</given-names>
            <surname>Karaban</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Karaban,
          <article-title>AI-translated poetry: Ivan Franko's poems in GPT-3.5-driven machine and human-produced translations</article-title>
          ,
          <source>Forum for Linguistic Studies</source>
          <volume>6</volume>
          (
          <year>2024</year>
          ) doi: 10.59400/fls.v6i1.
          <year>1994</year>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <article-title>Lexical Richness and Text Length: An Entropy-based Perspective</article-title>
          ,
          <source>Journal of Quantitative Linguistics</source>
          ,
          <volume>29</volume>
          :1 (
          <year>2022</year>
          )
          <fpage>62</fpage>
          -
          <lpage>79</lpage>
          . doi:
          <volume>10</volume>
          .1080/09296174.
          <year>2020</year>
          .1766346
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Buk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rovenchak</surname>
          </string-name>
          ,
          <article-title>Rank-Frequency Analysis for Functional Style Corpora of Ukrainian</article-title>
          ,
          <source>Journal of Quantitative Linguistics</source>
          <volume>11</volume>
          :
          <issue>3</issue>
          (
          <year>2004</year>
          )
          <fpage>161</fpage>
          -
          <lpage>171</lpage>
          . doi:
          <volume>10</volume>
          .1080/0929617042000314912
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Buk</surname>
          </string-name>
          ,
          <article-title>Statystychni kharakterystyky leksyky osnovnykh funktsionalnykh styliv ukrainskoi movy: Sproba porivniannia</article-title>
          ,
          <source>Leksykohrafichnyi biuleten 13</source>
          (
          <year>2006</year>
          ),
          <fpage>166</fpage>
          -
          <lpage>70</lpage>
          . URL: http://dspace.nbuv.gov.ua/handle/123456789/72846.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Divjak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharoff</surname>
          </string-name>
          , T. Erjavec, Slavic Corpus and
          <string-name>
            <given-names>Computational</given-names>
            <surname>Linguistics</surname>
          </string-name>
          ,
          <source>Journal of the Slavic Linguistics Society</source>
          <volume>25</volume>
          (
          <year>2017</year>
          )
          <fpage>171</fpage>
          -
          <lpage>199</lpage>
          . doi:
          <volume>10</volume>
          .1353/jsl.
          <year>2017</year>
          .
          <volume>0008</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vincze</surname>
          </string-name>
          ,
          <article-title>The Relationship of Dependency Relations and Parts of Speech in Hungarian</article-title>
          ,
          <source>Journal of Quantitative Linguistics</source>
          <volume>22</volume>
          :
          <issue>1</issue>
          (
          <year>2015</year>
          )
          <fpage>44</fpage>
          -
          <lpage>54</lpage>
          . doi:
          <volume>10</volume>
          .1080/09296174.
          <year>2014</year>
          .974458
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>V.</given-names>
            <surname>Perebyinis</surname>
          </string-name>
          ,
          <article-title>Statystychni metody dlia linhvistiv</article-title>
          .
          <source>Nova knyha, Vinnytsia</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>