=Paper= {{Paper |id=Vol-2917/paper28 |storemode=property |title=Corpus Technologies in Translation Studies: Fiction as Document |pdfUrl=https://ceur-ws.org/Vol-2917/paper28.pdf |volume=Vol-2917 |authors=Nataliia Hrytsiv,Tetiana Shestakevych,Julia Shyyka |dblpUrl=https://dblp.org/rec/conf/momlet/HrytsivSS21 }} ==Corpus Technologies in Translation Studies: Fiction as Document== https://ceur-ws.org/Vol-2917/paper28.pdf
Corpus Technologies in Translation Studies: Fiction as Document
Nataliia Hrytsiv, Tetiana Shestakevych and Julia Shyyka

Lviv Polytechnic National University, Stepana Bandery Street, 12, Lviv, 79000, Ukraine


                Abstract
                In the paper we focus on morphological, syntactic, semantic and anaphoric markup. Based on
                the raw material obtained from Jack London’s texts, we come to the following findings: (a)
                indices of vocabulary richness, exclusivity for the text and the vocabulary, the concentration
                of the vocabulary do not differ significantly; (b) the most frequent in the target text are
                functional words; (c) the epithetization index indicates the number of nouns per adjective in
                the text; (d) the index of verb phrases indicated the number of adverbs per verb; (e) the degree
                of nominality shows the number of nouns per verb, in the original text there are 1.23 nouns per
                verb, in the translated text - 1.36 per 1. To define the significance/insignificance of the
                statistical difference between the values of the coefficients for the source and target texts, χ2
                or criterion of homogeneity in linguostatistics has been calculated. Concluding the quantitative
                study of the collection "The Children of the Frost", it can be noted that: (a) the number of word
                usages in the source text exceeds the number of word usages in the target text both in the whole
                corpus and in separate stories.

                Keywords 1
                Translation studies, corpus linguistics, fiction, text mark-up, part-of-speech tagging, AntConc.

1 Introduction
    A central notion of corpus linguistics is the concept of "corpus of texts". On the one hand, it is the
main goal of corpus linguistics, and on the other hand, it is the object of study in this field of linguistics
[1].
    The linguistic corpus is characterized by some unique features that distinguish it from ordinary
collections of digital texts. The basic features of the text corpus are machine-readable (existence of a
digital form of text representation and a special system of encoding text data) and representativeness.
There are several similar definitions of corpus, which are based on the features of machine-readability
and representativeness as mandatory, determinative, for example: "collection of machine-readable texts
selected in such a way as to best represent the language and its diversity"; "corpora" is a large number
of natural language texts that have a computer form and are the object of a certain linguistic study,
where "natural" means everything that was expressed orally or in written form" [2, 3]. Complementing
machine readability and representativeness with the parameter of corpus applicability in linguistic
research, N. Dash and B. Chaudhuri define corpus as "a collection of linguistic data composed either
from written texts or from transcribed oral texts, the main purpose of which is to examine hypotheses
about language".
    The main problem in corpus linguistics is the creation of means of automatic (or at least automated)
text annotation according to different criteria – morphological, orthoepic, semantic, syntactic, etc.
According to V.A. Shyrokov, the main idea of the system engineering of a linguistic corpus (if to omit
technical details) is to automatically divide the electronic text of a literary source into "microcontexts"
i.e. in text fragments that are "grouped" around the word that is the object of interpretation. Thus, there


MoMLeT+DS 2021: 3rd International Workshop on Modern Machine Learning Technologies and Data Science, June 5, 2021, Lviv-Shatsk,
Ukraine
EMAIL: nhrytsiv@yahoo.com (N. Hrytsiv); Tetiana.v.shestakevych@lpnu.ua (T. Shestakevych); julia-shyika@ukr.net (Ju. Shyyka)
ORCID: 0000-0001-6660-7161 (N. Hrytsiv); 0000-0002-4898-6927 (T. Shestakevych); 0000-0003-2474-0479 (Ju. Shyyka)
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
is no need to form and store a traditional linguistic object – a lexical card as a separate physical object;
it becomes a virtual object, i.e. implemented as a relationship in the database. The key issue of corpus
issues are widely ranged and involve studies of the general theory of corpus linguistics.

2 Theoretical and procedural background
   In modern corpus linguistics apart from the creating of regular language corpora, the creation of
author corpora is widespread. Of special scholarly interest is a multilingual corpora are comparable and
parallel corpora or translation corpora [4].

2.1 Corpus technologies in translation studies
   In our research, it is advisable to conduct a contrastive analysis based on the comparison of certain
quantitative information of the original text and its translations at the lexical level. The defining of such
equivalence (or its absence) between the original text and the translation has been carried out in our
study.
   The main tool of corpus linguistics, which allows achieving the goal in a particular area, is the
corpus, which is viewed as a set of organized digital texts, which are used for certain linguistic purposes.
   Compiling a corpus, it is necessary to take into account some criteria: providing the
representativeness, authenticity, selectivity, balance, machine readability and markup. The markup
process (or annotation) fills the corpus with information that can then be used to investigate specific
issues [5, 6]. Annotation involves adding extralinguistic, structural, and linguistic special markers to
texts or their components. There are several types of linguistic markup: morphological, syntactic,
semantic, anaphoric and prosodic. Also, the following procedures are performed: tokenization,
lemmatization, stemming and parsing. Most of the existing corpora are of the morphological or
syntactic type. It is worth mentioning that syntactic corpora explicitly or implicitly contain
morphological characteristics of lexical units.

2.2 Structuring a case study
    Creating a corpus is quite a difficult task. According to Leach, to obtain some results, it is necessary
to carry out some preliminary work. Creating a corpus takes twice as long and sometimes ten times
more effort than its use. As has been already mentioned, the linguistic corpus contains a markup on at
least one linguistic parameter. This feature distinguishes the linguistic corpus from a large number of
other linguistic information and tool systems or databases. In other words, the corpus = text + markup.
The process of tagging [7, 8] or annotation is to attribute to the texts and their components special tags:
     external, extralinguistic (information about the author and the text: author, title, year and place
    of publication, genre, subject; information about the author may include not only his/her name, but
    also age, gender, life-years etc. (this information coding is called meta-markup);
     structural (chapter, paragraph, sentence, word form);
     linguistic, which describe lexical, grammatical and other characteristics of the text elements.
    The set of these metadata largely determines the competencies provided by the corpora to
researchers. The annotation adds value to the corpus, as it significantly expands the range of research
issues that can be investigated on the material of this corpus. When choosing these data it is necessary
to take into consideration the purposes of research and needs of linguists as well as the possibility to
add to the text some extra features.
    It is also necessary to note the difference between annotation and structural markup of the
corpus/corpus data. The distinction between these two concepts is based on the definition of annotation
offered by G. Leach, according to which "the process of annotation of corpus data is the adding of
interpreted linguistic information to the digital corpus of oral and/or written speech.
    The term "interpreted linguistic information" means an annotation that is to some extent a product
of the understanding of the text by the human mind (for example, an annotation on parts of speech).
Markup provides relatively objective verified information about the parts of the corpus and the structure
of each text.
   Linguistic annotation in corpus linguistics is traditionally interpreted as:
    any linguistic information about linguistically relevant units of text data presented in a formal
   code;
    adding formalized linguistic information into the digital text;
    presence of such information in the text.
   Linguistic annotation can be performed at different levels and take different forms. Thus, there are
the following types of markup:
    morphological markup: in foreign terminology, the term part-of-speech tagging (POS-tagging)
   is used. Morphological markups include not only the features of a part of speech but also the features
   of grammatical categories specific to that part of speech. This is the main type of markup because
   most large corpora are morphologically marked. Besides, morphological analysis is considered as
   a basis for further forms of analysis – syntactic and semantic. Moreover, advances in computer
   morphology allow marking large corpora automatically. For example, the British National Corpus
   is marked like that. The used markup is known as markup C5 (basic) and C7 (supplemented).
    syntactic markup: that results from syntactic analysis or parsing, based on morphological
   analysis data. This type of markup describes the syntactic relations between lexical units and various
   syntactic constructions (for example, a subordinate clause, a verb phrase, etc.).
    semantic markup: although there is no single semantic theory for semantics, most often
   semantic tags denote the semantic categories to which a word or phrase belongs as well as narrower
   subcategories that specify its meaning.
    anaphoric markup: it fixes referential connections, for example, of the pronoun. Prosodic
   markup. In prosodic corpora, markups describing stress and intonation are used. In the oral speech
   corpora, prosodic markup is often accompanied by so-called discourse markup, which is used to
   indicate pauses, repetitions, warnings, etc. Other types of information can also be encoded in the
   corpus. For example, sociolinguistic information about such characteristics as gender, age, social
   status, and place of residence may be presented in colloquial speech corpora.
   During the creation of the corpus several procedures and programs are used, such as tokenization,
lemmatization, stemming and parsing.
   Tokenization is a division of the string of natural language symbols into separate significant units
(tokens, word forms).
   Lemmatization is a process of forming the initial form of a word, based on its other forms.
   Stamming is the process of defining a stem of a word.
   Parsing is the process of analyzing the syntactic structure of a text or part of a text, which is based
on comparing the linear sequence of tokens (words, tokens) of language with its formal grammar. The
construction of automatic parsers for large corpora is one of the most important areas of computational
linguistics.
   Compiling a corpus and its use can be implemented according to the “model”: Annotation -
Abstraction - Analysis. The first stage involves the collection, standardization, segmentation,
processing and often manual checking of texts. At the second stage, the research topic is selected, the
parameters are determined and the corresponding fragment is removed from the corpus. At the final
stage, the hypothesis is constructed and examined, specific rules and systematized structures are
searched. During the analysis, automatic and interactive methods, as well as quantitative and qualitative
analytical methods are used. The most difficult is to identify specific issues to solve real problems of
theoretical, descriptive and applied linguistics.
   Thus, fulfilling all the requirements for the creation of a linguistic corpus, a convenient tool for
further use in the work of any complexity to solve certain problems can be obtained.

2.3 A case study constrains
   The sample of texts for the study has been formed according to the rules of sampling by
V. Perebyinis:
    texts should be chronologically limited (the chronological boundaries for the materials of our
   study are 1900-1902, it was during this period that collections of short stories "The Son of the Wolf"
   (1900), "The God of His Fathers" (1901) and “Children of the Frost” (1902) were written and
   published. Fathers (1901);
    texts should be limited in the genre (the sample for our study contains only short stories, so it
   represents the short prose of the writer);
    texts should be thematically limited (the main topics of the stories are the lives of gold diggers
   and Indians of the North as well as travelling and adventures in the southern seas);
    texts should be homogeneous in the author's style (all stories belong to one writer, and proving
   the homogeneity of the author's style is the task of our study).

3 Results and discussion
3.1 Raw material resource base
   It has been presumed that statistical population is presented by the Northern stories of J. London and
their Ukrainian translations, which were made in the early twentieth century and of which the sample
has been formed by the method of purposeful sampling which corresponds to the main condition of its
organization i.e. representativeness. The sample size is 82665-word usages.
   The following works were selected for further analysis:

Table 1
Source base of the research (original stories)
                        Story                                    Number of word usages
                The White Silence                                         3733
               The Son of the Wolf                                        6114
              The Men of Forty-Mile                                       3156
                 In a Far Country                                         6239
               To the Man on the Trail                                    3139
             The Priestly Prerogative                                     4103
             The Wisdom of the Trail                                      2988
                The Wife of a King                                        4834
             An Odyssey of the North                                      10669
           In the Forests of the North                                    5970
                  The Law of Life                                         2836
            Nam-Bok the Unveracious                                       4500
              The Master of Mystery                                       4085
                  The Sunlanders                                          6368
           The Sickness of Lone Chief                                     3632
             Keesh, the Son of Keesh                                      3135
               The Death of Ligoun                                        3610
                  Li Wan, the Fair                                        5249
           The League of the Old Men                                      6293
                       Total:                                             91253
   Thus, 19 stories of J. London and 19 translations into Ukrainian have been selected.
Figure 2: Source base of the research (translations)

3.2 Concordance and AntConc options
    The text components include units of all language levels. The most significant unit is the word, the
minimum semantic language unit, which is freely reproduced in the language and serves to build
expressions (sentences). Stylistic analysis of texts shows that texts differ in units of all language levels,
but the stylistic potential of these units is different: the smallest unit is at the phonetic level, the largest
unit is at the lexical and syntactic levels. That is why lexical units of text (words) are the object of our
research.
    For quantitative research, the AntConc program environment has been used [9, 10, 11]. V. Shyrokov
claims that the main and quite effective working tool of corpus linguistics is the concordancer, a special
program, which quickly processes the corpus of texts and can perform the following functions: search
for morphemes, lexical units, phrases in the context; determining the frequency rank of grammatical
forms. Dictionaries and concordances can be compiled with the help of specialized programs and tools
based on the corpus of texts of any genres.
    A distinction should be made between two terms - "concordance" and "concordancer". Concordance
is defined as a catalogue that contains all words used in a particular text or the works of a particular
author (authors) or a list of all usages of the selected word, presented in the context of its use.
    In linguistic research, concordance is used to perform the following tasks: comparing different word
usages of one lexical unit, measuring and comparing the frequency of words and phrases, searching and
analyzing idioms and paremiological units, creating word lists by selected criteria, analyzing
terminological vocabulary.
    Concordancer is defined as instrumental software that searches in corpus the usage of the required
word, phrase or another specified element with a representation of the contextual environment of this
unit".
    The advantage of the concordancer over other software for linguistic research is the ability to detect
the compatibility of words and their environment. The list of concordancer options is larger than
concordance: it can search for words by lemmas and morphological features, search for specific
language units and present results in concordance views, display text-type information, output lexical
and grammatical statistics, output search results from a predefined number of words (length of word
context), save the results in a separate file.
    Concordancers can be divided into are index (require pre-indexing of text units), non-index, or
network (require Internet connection and are based on its technology, mostly paid) and stationary (can
be downloaded to a personal computer, do not require an Internet connection, mostly free).
    To conduct our study, we used the AntConc program i.e. a stationary, non-index concordancer. The
technical advantages of this program include free access to the Internet, free upgrades and compatibility
with the main operating systems - Windows, Mac OS X and Linux. The AntConc program is very useful
for linguists who are engaged in linguistic and statistical research because it can perform the following
operations:
     to find lexical units and stable expressions of a certain subject area, given by the user;
     to make samples of lexical units which belong to one thematic group;
     to identify usage contexts of given words.
    Processing the entered data, AntConc can sort the selected language units according to the criteria
set by the user, namely:
     frequency;
     word ending;
     alphabetical order.
    The program also helps to choose the number of words to the right and left of the search word
displayed in the program window, get the attributes of a given word in alphabetical order, search for
collocations by constructing n-grams of different lengths, compare keywords in different text boxes.
    So AntConc is a free, multi-platform tool for corpus linguistics research and data learning. This
program is developed in Perl using the PerlApp ActiveState compiler to create executable files for
different operating systems. It does not require special installation, just double-click on the program
icon to run it. AntConc contains seven tools that can be run by clicking on their "bookmarks" in the
program window, or by using the function keys F1 - F7.
    This tool enables the identification of corpus-specific words. They have the following
characteristics:
     Concordance. It presents search results in KWIC format (keyword-in-context) and shows how
    words and phrases are usually used in the corpus of texts.
     Concordance Plot. This tool shows search results on a barcode chart and shows where the search
    results appear in the ascending texts.
     File Viewer. This tool shows the text of separate files and allows interrogating the results
    obtained by other tools of AntConc.
     Clusters (N-Grams). This tool shows clusters based on search terms. It summarizes the results
    obtained by such tools as Concordance or Concordance Diagram. The N-gram tool scans the entire
    corpus for the length of the clusters in the "N" number of words (for example, one word, two words,
    ...). This enables us to find common expressions in the corpus.
     Collocates. This tool shows search word collocations and enables to explore inconsistent
    patterns in language.
     Words List. This tool counts all the words in the corpus and presents them in an ordered list
    and allows finding which words are most used in the corpus.
     Keyword List. This tool shows which words are unusually common (or rare) in the corpus.
   The screenshots and figures to follow present the program interface in details and the list of the
abovementioned tools. Thus, the quantitative distribution of words in originals texts and their
translations has been made with the help of the program AntConc.




Figure 2: AntConc interface




Figure 3: Quantitative word distribution in the original




Figure 4: Quantitative word distribution in the translation

    At the next stage of the study, the results of preliminary processing of the texts have been transferred
to the MS Excel environment and the part of speech of each word as well as its lemma and the number
of uses separately for Ukrainian and English texts has been identified
Figure 5: Analysis of parts of speech and quantitative distribution of words in the original texts




Figure 6: Analysis of parts of speech and quantitative distribution of words in translation

   In the current study the traditional classification of parts of speech has been used:
    for Ukrainian: content words such as nouns, adjectives, pronouns, verbs (particle,
   transgressive), adverbs, numerals; function words such as prepositions, conjunctions, particles,
   interjections;
    for English: content words such as nouns, adjectives, pronouns, verbs, adverbs, numerals;
   function words such as prepositions, conjunctions, articles, interjections.
   The following principles of combining word forms have been used during lemmatization:
    for Ukrainian: noun forms have been reduced to the nominative singular; verbs - to the
   infinitive; all adjective forms, including degrees of comparison, have been reduced to the nominative
   singular of the masculine gender; the comparative and superlative adverbs have been reduced to the
   original forms of adverbs; case forms have been reduced according to the type of declension of
   pronouns and numerals; phonetic variants of words have been reduced to the original form (the most
   frequent), where the alternation of first or final letters is caused by euphonism;
    for English: possessive forms and plural forms of the noun have been reduced to the original
   form; all tense forms of verbs, gerunds, Participle I and Participle II have been reduced to the
   infinitive; the degrees of comparison of adjectives have been reduced to the original form.
   With the help of a specially written computer program, the absolute frequency of each lemma has
been automatically calculated.

3.3.    Quantitative characteristics of the original: a case study of ‘Sun of the
        Wolf’ collection
   In our study, priority has been given to the vocabulary of the original text and its translation, and
with the help of the automatic processing of the corpus and statistical calculations, several important
characteristics have been defined which can form the basis for clarifying the writer’s individual style
and help to conclude the aesthetic significance of the original texts and its translations.
    From the point of view of the quantitative-linguistic analysis of texts [12-15], several problems have
been defined which refer to the stylistic aspect of research of vocabulary of the given text, in particular
a volume of the text i.e. total number of words in the text (N), volume of vocabulary, number of word
forms (Vf), the volume of the vocabulary of lexemes, the number of lemmatized words in the text (V),
Hapax legomena (V1) i.e. words that occur in the sample once and their frequency equals 1, the number
of words with a frequency ≥ 10 (V10), number of letters (C), number of sentences (S), number of
content words (Nps), number of functional words (Npsa) [16, 17]. Based on these statistic data, it is
possible to calculate:
     vocabulary richness, diversity index (Id) - the ratio of the volume of the vocabulary of lexemes
    (V) to the volume of the text (N) is calculated by the formula: Id = V / N;
     the average repeatability of the word in the text (Iwr) - the ratio of the volume of the text (N)
    to the volume of the vocabulary of lexemes (V) - the value inverse to the diversity index is calculated
    by the formula: Iwr = N/V;
     exclusivity index is calculated separately for the vocabulary and the text, it characterizes the
    variability of vocabulary, i.e. the part of the text (vocabulary) which consists of the words that
    occurred once in the text: the index of exclusivity for the vocabulary (Iev) - the ratio of lexemes with
    frequency 1 (V1) to the total number of lexemes: Iev =V1/V; text exclusivity index (Ien) - the ratio
    of the number of lexemes with frequency 1 (V1) to the text volume (N): Ien = V1/N;
     dictionary concentration index (Ivc) - the ratio of the number of words in the vocabulary with
    an absolute frequency of 10 or more (V10) to the total number of words in the vocabulary: Ivc =
    V10/V;
     lexical density index (Id) - the ratio of content words (Nps) in the text to the total number of
    words: Id = Nps / V;
     automatic readability index (ARI) - the degree of readability of texts, the ratio of characters in
    the word and the number of sentences; it is calculated by the formula: ARI = 4.71 * C / V + 0.5 * V
    / S - 21.43;
     noun phrases index (Inat), i.e. the epithetization index (the ratio between the number of nouns
    (Vn) and the number of adjectives (Vadj): Inat = Vn/ Vadj;
     verb phrases index (Ivat) - the ratio of the number of adverbs (Vadv) to the number of verbs
    (Vv): Ivat = Vadv/Vv;
     degree of nominality (Inom) - the ratio of the number of nouns (Vn) to the number of verbs
    (Vv): Inom = Vn/Vv.
     verb index (aggressiveness index) (Іv) - the ratio of the number of verbs and verb forms
    (particles and adverbs) (Vv) to the total number of the words: Іv = Vv / N;
     logical coherence index (Ilk) - the ratio of the total number of functional words (conjunctions
    and prepositions) (Vpc) to the total number of sentences (S): Ilk = Vpc / S;
     embolism index (Iem) (clogging) of speech – the ratio of the total number of emboli
    (exclamations and particles) (Vem) to the total number of words: Iem = Vem / N.
    According to this scheme, the subcorpus of the original texts of the collection "The Son of the Wolf"
has been analyzed. General quantitative characteristics of the stories are given in Table 2.
    Morphological markup of the corpus of short stories has been performed, the classical division of
words into parts of speech has been applied, and for each part of speech, its frequency in the text and
the author's vocabulary (register) has been automatically obtained:
    The most frequent words in the original text are functional words: 5.2% of the vocabulary. They
function most actively in the text and cover almost a third of it: 30.7%. Pronouns are also frequently
used in the text: 3% of the vocabulary and about 13% of the text. Adverbs have approximately the same
percentage in the text and the vocabulary (9% and 10%, respectively) and numerals: about 1%. Nouns
(22% and 39%), verbs (19% and 26%) and adjectives (8% and 16%) in the text and the writer's
vocabulary, respectively, confirm their stylistic function and their ratio proves the nominality of
J. London’s individual style.
   Based on the established general quantitative characteristics of the text and the base of partial
linguistic distribution indicators, the indices that characterize the lexical level of the corpus have been
calculated. The results of the calculations are presented in Table 3.

Table 2
General quantitative characteristics of the original text




                                                                                                                                                                                                          7. Number of lemmas used 10 or more




                                                                                                                                                                                                                                                                                      9. The number of sentences in the text
                                                                                                                                      5.Number of word forms used 10 or
                                                                                                   4. Hapax legomena for word forms




                                                                                                                                                                                                                                                8. The number of characters of the
                                                                                                                                                                           6. Hapax legomena for lemmas




                                                                                                                                                                                                                                                extended alphabet in the text
                                                  2. Number of word forms
                       1. Number of word usage




                                                                             3. Number of words




                                                                                                                                      more times




                                                                                                                                                                                                          times
      Story




The         White     3733                       1326                       1072                  891                                        52                           682                                  55                               15835                                235
Silence
The Son of the        6114                       1785                       1471                  1166                                       85                           904                                  87                               26236                                380
Wolf
The Men of            3156                       1185                       949                   830                                        48                           615                                  47                               13468                                219
Forty-Mile
In a Far Country      6239                       2032                       1779                  1390                                       80                           1168                                 76                               28009                                391
To the Man on         3139                       1223                       986                   844                                        39                            636                                 39                               13582                                190
the Trail
The        Priestly   4703                       1487                       1223                  983                                        80                           783                                  74                               19622                                367
Prerogative
The Wisdom of         2988                       1027                       875                   689                                        48                           567                                  43                               12826                                166
the Trail
The Wife of a         4834                       2404                       1566                  1826                                       49                           943                                  74                               24556                                323
King
An Odyssey of         10669                      1818                       1755                  1037                                    152                             1037                             152                                  42601                                683
the North
    Total:            45575                      14287                      11676                 9656                                    633                             7335                             647                                  196735                               2954

   Although the indices of epithetization, nominality and verb phrases are not the main characteristics
of the stylistic interpretation of the text, they can be considered a significant complement to the
qualitative analysis of the text, especially in comparing the original text and the translated text.
Nominative and verb (aggressiveness) indices confirm the nominal individual style of J. London's short
prose, which is characterized by a high level of epithets, verb phrases, low "clutter" of speech and high
logical coherence.
Table 3
Quantitative indicators of the lexical level of the original text




                                                                                                         Index of exclusivity for the text
                                       The richness of the vocabulary

                                                                        The average repetition of the




                                                                                                                                                                                                                                  Automatic readability index
                                                                                                                                                                             Index of text concentration
                                                                                                                                             Index of exclusivity for the




                                                                                                                                                                                                                                                                                                                                                                                                              Speech embolism index
                                                                                                                                                                                                                                                                                                                                                                                   logical coherence index
                                                                                                                                                                                                                                                                                           Index of nouns phrases


                                                                                                                                                                                                                                                                                                                     Index of verb phrases
                                                                                                                                                                                                                                                                 Lexical density index
                                                                                                                                                                                                           Index of vocabulary




                                                                                                                                                                                                                                                                                                                                              Index of nominality
                                                                        word in the text




                                                                                                                                                                                                           concentration
                                                                                                                                             vocabulary




                                                                                                                                                                                                                                                                                                                                                                     Verb index
                Story




 The White Silence                   0,29                               3,48                            0,24                                 0,64                           0,01                           0,05                  6,49                           0,19                     2,79                       0,3                      1,2                    0,19          3,02                       0,02
 The Son of the Wolf                 0,24                               4,16                            0,19                                 0,61                           0,01                           0,06                  6,83                           0,19                     3,4                        0,35                     1,17                   0,19          3                          0
 The Men of Forty-Mile               0,3                                3,33                            0,26                                 0,65                           0,02                           0,05                  5,88                           0,2                      4,66                       0,25                     1,19                   0,21          2,83                       0
 In a Far Country                    0,29                               3,51                            0,22                                 0,66                           0,01                           0,04                  7,69                           0,21                     2,39                       0,26                     1,06                   0,2           3,33                       0,01
 To the Man on the Trail             0,31                               3,18                            0,27                                 0,65                           0,01                           0,04                  7,21                           0,19                     2,75                       0,31                     1,31                   0,18          3,19                       0
 The Priestly Prerogative            0,26                               3,85                            0,21                                 0,64                           0,02                           0,06                  4,63                           0,18                     3,96                       0,28                     1,05                   0,22          2,31                       0,01
  The Wisdom of the Trail            0,29                               3,41                            0,23                                 0,65                           0,02                           0,05                  7,79                           0,2                      3,05                       0,38                     1,09                   0,2           3,55                       0
 The Wife of a King                  0,32                               3,14                            0,37                                 0,6                            0,01                           0,05                  9,72                           0,21                     3,18                       0,4                      1,51                   0,19          3,13                       0,03
 An Odyssey of the North             0,17                               5,87                            0,1                                  0,57                           0,01                           0,08                  5,19                           0,23                     3,2                        0,52                     1,43                   0,15          3,57                       0
 Average value in the corpus         0,27                               3,81                            0,23                                 0,63                           0,01                           0,05                  6,87                           0,20                     3,32                       0,34                     1,23                   0,19          3,11                       0,01
3.4.    Quantitative characteristics of the translations of ‘Sun of the Wolf’
        collection
   On the next stage of the research the similar calculations for the translated stories of Jack London
have been made

Table 4
General quantitative characteristics of translated texts




                                                                                                                                                                                         of the extended alphabet in
                                                                                                                                                              Number of lemmas used 10
                                                                                       Hapax legomena for word




                                                                                                                                                                                         The number of characters



                                                                                                                                                                                                                       The number of sentences
                                                                                                                 Number of word forms
                                              Number of word forms
                       Number of word usage




                                                                                                                 used 10 or more times

                                                                                                                                         Hapax legomena for
                                                                     Number of words




                                                                                                                                                              or more times




                                                                                                                                                                                                                       in the text
                                                                                                                                         lemmas




                                                                                                                                                                                         the text
                                                                                       forms
        Story




 The        White 3083 1675 1221 1324                                                                                 32                 845                     49                        1551                          266
 Silence
 The Son of the 4911 2404 1566 1826                                                                                   49                 943                     74                      24556                           385
 Wolf
   The Men of     2683 1399 1014 1060                                                                                 29                 637                     40                      12986                           240
    Forty-Mile
      In a Far    4998 2612 1978 2078                                                                                 51                 1322                    61                      25575                           397
     Country
 To the Man on 2973 1601 1208 1247                                                                                    27                 832                     34                      14902                           200
     the Trail
   The Priestly   3983 1994 1321 1575                                                                                 40                 823                     58                      18953                           235
   Prerogative
  The Wisdom      2440 1307    964  1017                                                                              28                 653                     37                      12170                           179
   of the Trail
  The Wife of a 4751 2339 1588 1790                                                                                   53                 297                     64                      24329                           331
        King
 An Odyssey of 7191 2319 2319 1630                                                                                    81                 1630                    81                      31980                           747
    the North
       Total:     37013 17650 13179 13547                                                                          390                   8657                 498                        167002                        2980

    Based on the morphological corpus markup of translated stories the frequency of each part of speech
in the text and in the author’s register has been automatically calculated.
    The most frequent words in the target text are the functional parts of speech which have 5.74% of
the vocabulary and 25.98% of the texts. High frequency is also shown by pronouns (3.07% of the
vocabulary and 12, 34% of the text); adverbs (9.06% of the vocabulary and 11.19% of the text) and
numerals (1.3% of the vocabulary).
    Nouns (25.23% and 33.13%), verbs (18.64% and 31.10%) and adjectives (7.42% and 14.45%) in
the text and in the writer's vocabulary prove their stylistic function and their ratio shows the nominality
of J. London’s individual style.
    Based on the defined general quantitative characteristics of the text and indicators of distribution by
part of speech, the indices that characterize the lexical level of the corpus have been calculated
(Table 6).
    After a series of calculations, it has been found that the richness of the vocabulary i.e. the ratio of
the volume of the vocabulary of tokens on average in the corpus is 0.37. Therefore, the richness of the
vocabulary of translated stories can be considered high.
    The average repetition of a word in the text is 2.77, i.e. each word in the text is used about three
times.
    The index of exclusivity is calculated for the vocabulary and the text is 0.65 and 0.39, respectively,
which indicates a high artistic level of translated texts.
    The opposite of the index of exclusivity is the index of concentration of vocabulary and text, which
is 0.04 and 0.01, respectively.
    The index of the lexical density of translated stories by J. London is 0.2, so the functional words
have about 20%, which indicates a sufficient density of the text.
    The automatic readability index is also important for our study, because the higher the ARI, the more
difficult to understand the text is. ARI of translated stories by J. London is 8.04, which means that these
works correspond to the eighth level of complexity, they are easy to understand, but not primitive.
    Indices of epithetization, nominality and verb phrases, which are shown in Fig. 10 indicate a high
level of epithet-phrases, verb phrases, low "clutter" of speech and high logical coherence.

3.5.    Comparative quantitative characteristics of the source and target texts
   The analysis of J. London's collection of short stories "The Son of the Wolf" has been conducted on
the basis of the digital marked corpus of source and target texts; it covers a number of general
characteristics (ST stands for source text; TT is for target text).

Table 6
General quantitative characteristics of the source and target texts (word usage, word forms, lemmas)
                               Number of word usage        Number of word forms        Number of words
            Story
                                  ST            TT           ST             TT           ST         TT
 The White Silence             3733          3083         1326           1675          1072       1221
 The Son of the Wolf           6114          4911         1785           2404          1471       1566
 The Men of Forty-Mile         3156          2683         1185           1399          949        1014
 In a Far Country              6239          4998         2032           2612          1779       1978
 To the Man on the Trail       3139          2973         1223           1601          986        1208
 The Priestly Prerogative      4703          3983         1487           1994          1223       1321
  The Wisdom of the Trail      2988          2440         1027           1307          875        964
 The Wife of a King            4911          4751         2404           2339          1566       1588
 An Odyssey of the North       10669         7191         1818           2319          1818       2319
 Totally                       45652         37013        14287          17650         11739      12179

   The frequency of each part of speech in the text and the vocabulary of the author (translators) has
been compared because the ratio of parts of speech is an important statistical parameter of the individual
style of both the author and a particular work.
Table 7
Quantitative indicators of the lexical level of translation




                                                               The average repetition
                                                               of the word in the text

                                                                                         Index of exclusivity for


                                                                                                                    Index of exclusivity for




                                                                                                                                                                                     Automatic readability




                                                                                                                                                                                                                                                      Index of verb phrases
                                                                                                                                                                                                             Lexical density index
                                                                                                                                                               Index of vocabulary




                                                                                                                                                                                                                                                                              Index of nominality
                                         The richness of the




                                                                                                                                                                                                                                                                                                                                     Speech embolism
                                                                                                                                                                                                                                                                                                                 logical coherence
                                                                                                                    the vocabulary




                                                                                                                                                                                                                                     Index of nouns
                                                                                                                                               concentration


                                                                                                                                                               concentration
                                                                                                                                               Index of text
                                         vocabulary




                                                                                                                                                                                                                                                                                                    Verb index
                                                                                         the text




                                                                                                                                                                                                                                     phrases
                                                                                                                                                                                     index




                                                                                                                                                                                                                                                                                                                 index


                                                                                                                                                                                                                                                                                                                                     index
                 Story




 The White Silence
                                                        0,4               2,52             0,43                            0,69                 0,01            0,04                  7,36                        0,2                 3,98            0,49                    1,39                  0,19          2,37               0,04
 The Son of the Wolf
                                                    0,32                  3,14             0,37                                  0,6            0,01            0,05                       8,5               0,21                     3,18                 0,4                1,51                  0,19          2,63               0,03
 The Men of Forty-Mile
                                                    0,38                  2,65                  0,4                        0,63                 0,01            0,04                  6,96                   0,19                     3,54            0,47                    1,26                    0,2         2,17               0,05
 In a Far Country
                                                        0,4               2,53             0,42                            0,67                 0,01            0,03                  8,97                   0,21                             3       0,49                    1,39                  0,19          2,59               0,05
 To the Man on the Trail
                                                    0,41                  2,46             0,42                            0,69                 0,01            0,03                  9,61                   0,21                     3,67            0,46                    1,46                  0,18          3,06               0,04
 The Priestly Prerogative
                                                    0,33                  3,02                  0,4                        0,62                 0,01            0,04                  7,88                   0,19                        3,8          0,49                    1,28                  0,19          2,89               0,05
 The Wisdom of the Trail
                                                        0,4               2,53             0,42                            0,68                 0,01            0,04                  8,88                        0,2                    3,2               0,5                1,28                  0,19          2,79               0,04
 The Wife of a King
                                                    0,33                  2,99             0,38                            0,61                 0,01            0,04                  9,87                   0,22                        2,8          0,41                    1,48                  0,18              3,1            0,04
 An Odyssey of the North
                                                    0,32                       3,1         0,23                                  0,7            0,01            0,03                  4,33                   0,28                     4,15                 0,6                1,18                  0,18          2,69               0,04
 Average value in the corpus
                                                    0,37                  2,77             0,39                            0,65                 0,01            0,04                  8,04                   0,21                     3,48            0,48                    1,36                  0,19          2,70               0,04
   The most frequent words in the source and target texts are the functional words (6% of the source
text vocabulary and 7.47% of the target text vocabulary). Also, functional words are the most frequently
used in texts (29.91% in the source text and 24.11% in target text). Share of nouns and the verbs are
23% and 25% in the source texts and 18% and 19% in the target text, respectively. Pronouns are also
frequently used in the texts (3.11% in the source text vocabulary and 3.04% in the target text
vocabulary). The share of pronouns in the texts is about 13%. Approximately the same share in the text
and the vocabulary goes to adverbs (7.22% and 7.17% in the source text and 10.13% and 10.08% in the
target text) and numerals (1.12% and 1.07 in the source text and 1.36% and 1.06% in the target text).

Figure 8
Part of the speech frequency of the vocabulary and the source and target texts
      Part of speech                  Word usage                               Words
                             Source text        Target text       Source text        Target text
          Noun                   10415                9340               4899                4982
          Verb                   8397                 6901               2940                4502
        Adjective                3254                 2746               1922                2162
         Adverb                  3098                 3024               930                 1604
         Pronoun                  5505                4566                269                 523
        Numeral                   501                 491                 501                185
    Functional words             14336               9616                 468                819
         Totally                 45506               36684               11929              14777

    Nouns, verbs and adjectives are the most frequently used. Their relative number in the vocabulary,
on the contrary, exceeds the relative number in both source and target texts.
    These parts of speech present the richness of the vocabulary of the source and target texts, and also
their ratio confirms that the nominal character of the individual style of the original text was preserved
in the translated text.
    Quantitative relations between parts of speech have been compared, as they are considered key
elements of the statistical characteristics of the text are presented below.

Table 9
Quantitative relations between parts of speech
                 Index                Mean value on the corpus of the           Mean value on the
                                               source text                    corpus of the target text
         Lexical density index                    0.20                                  0.21
     Index of nominal phrases                     3.32                                  3.48
        Index of verb phrases                     0.34                                  0.48
           Nominality index                       1.23                                  1.36
              Verb index                          0.19                                  0.19
      Logical coherence index                     3.11                                  2.70
           Embolism index                         0.01                                  0.04



4    Conclusions
   According to these criteria, a collection of short stories by J. London "The Son of the Wolf" and its
translation into Ukrainian has been selected for further analysis.
   To conduct quantitative research, the AntConc program environment i.e. a stationary, index-free
concordance has been used. The technical advantages of this program include free access on the
Internet, free upgrades and compatibility with three operating systems - Windows, Mac OS X and
Linux. Processing the entered data AntConc sorts the selected language units according to the criteria
set by the user, namely:
    • frequency;
    • ending of the word;
    • alphabetical order.
    The program also allows to select the number of words to the right and left of the search word
displayed in the program window, get the attributes of a given word in alphabetical order, search for
collocations by constructing n-grams of different lengths, and compare keywords in different text
corpora.
    At the next stage of the study, the results of the preliminary processing of the texts of the stories
have been transferred to the MS Excel environment, where each part of speech its lemma and the
number of uses in source and target texts have been identified.
    In the study, the priority has been given to the lexical content of source and target texts, and with
the help of the automatic processing of the corpus and statistical calculations, several important
characteristics that are basic for clarifying the idiosyncrasy of the writer and which help to conclude the
aesthetic significance and their equivalence have been identified.
    The paper presents a statistical analysis of selected works by Jack London and their translation into
Ukrainian, which have been compared; some tables and diagrams, which identify features of functional
language styles and features of the author's style, have been made.
    Summing up the quantitative study of the collection "The Son of the Wolf", it should be noted:
    • indices of vocabulary richness, exclusivity for the text and the vocabulary, the concentration of the
    vocabulary do not differ significantly;
    • the most frequent in the target text are functional words (24.91% in the source and 24.11% in the
    target text). The share of nouns and verbs is approximately 23% and 25% in the source and 18% and
    19% in the target texts, respectively. The share of pronouns is comparatively big – about 13% of the
    text. Approximately the same share in the text and the vocabulary is covered by adverbs (7.22% and
    7.17% in source and 10.13% and 10.08% in target text) and numerals (1.12% and 1.07 in source and
    1.36% and 1.06% in target text).
    • the epithetization index indicates the number of nouns per adjective in the text, i.e. the higher the
    index, the fewer adjectives per noun. It may be concluded that the index both in source and target
    text does not differ significantly – 3.32 / 3.48, and therefore the translator managed to preserve the
    saturation of the text with figurative phrases.
    • the index of verb phrases indicated the number of adverbs per verb. The target text has a slightly
    bigger ratio – 0.48 adverbs per verb, while in the source text the ratio is 0.34 per 1.
    • the degree of nominality shows the number of nouns per verb, in the original text there are 1.23
    nouns per verb, in the translated text - 1.36 per 1. Therefore, the degree of aggression is equal in
    source and target text. This confirms the fact that the nominal character of the original text is
    accurately reproduced in translation.

5    Acknowledgements
   The project has been carried out within the complex academic topic “Application of modern
technologies for optimization of information processes in natural language” at Lviv Polytechnic
National University. At the initial stage the project underwent the consultancy of Ihor Kulchytskyy, to
whom we express our gratitude.

6    References
[1] T.J.M. Sanders, V. Demberg, J. Hoek, M.C.J. Scholman, F.T. Asr, S. Zufferey, J. Evers-Vermeul,
    Unifying dimensions in coherence relations: How various annotation frameworks are related,
    Corpus Linguistics and Linguistic Theory, 2021, 17 (1), pp. 1-71.
[2] Mosavi Miangah, Tayebeh. Different Aspects of Exploiting Corpora in Language Learning.
    Journal of Language Teaching Research, 2012, 3, 1051-1060.
[3] M.N. Wróblewska, Research impact evaluation and academic discourse. Humanities and Social
     Sciences Communications, 2021, 8 (1), art. no. 58.
[4] E. Aarden, L. Marelli, A. Blasimme, The translational lag narrative in policy discourse in the
     United States and the European Union: a comparative study. Humanities and Social Sciences
     Communications, 2021, 8 (1), art. no. 107.
[5] F. Cifuentes-Silva, J.E. Labra Gayo, Legislative document content extraction based on semantic
     web technologies: A use case about processing the history of the law. Lecture Notes in Computer
     Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
     Bioinformatics), 11503 LNCS, 2019, pp. 558-573.
[6] C. Fang, S. Zhang, Geographic information retrieval method for geography mark-Up language
     data. ISPRS International Journal of Geo-Information, 2018, 7 (3), art. no. 89.
[7] L. Xiao, D. Wei, Linguistic Question-Answering Reasoning Based on Intelligent Perception of
     Attribute Weight. Journal of Physics: Conference Series, 2021, 1883 (1), art. no. 012137.
[8] C. Chantrapornchai, A. Tunsakul, Information extraction on tourism domain using SpaCy and
     BERT. ECTI Transactions on Computer and Information Technology, 2021, 15 (1), pp. 108-122.
[9] J.S.N. Rodrigues, K. Ferreguetti, A.S. Pagano, A proposal of coextensiveness between technical
     term, nominal group, and lexical item in Brazilian Portuguese: A study based on corpus linguistics’
     software within the framework of systemic-functional theory [Uma proposta de coextensividade
     entre termo tècnico, grupo nominal e item lexical no Português Brasileiro: Um estudo com base
     em ferramentas da linguística de corpus sob o arcabouço de teoria sistêmico-funcional] Revista de
     Estudos da Linguagem, 29 (2), pp. 1325-1379, 2021.
[10] S. Rezaei, D. Kuhi, M. Saeidi, Diachronic corpus analysis of stance markers in research articles:
     The field of applied linguistics. Cogent Arts and Humanities, 2021, 8 (1), art. no. 1872165.
[11] K. Uzule, Teacher training and education programs in latvia: Are e-competences included?
     Business Management and Education, 2020, 18 (2), pp. 294-306.
[12] M. Dilai, O. Levchenko, Discourses, Surrounding Feminism in Ukraine: A Sentiment Analysis of
     Twitter Data. 2018 IEEE 13thInternational Scientific and Technical Conference on Computer
     Sciences andInformation Technologies, CSIT 2018 - Proceedings, 2, art. no. 8526694, 2018,
     pp.47–50.
[13] G. Szymanski, P. Lipinski, Model of the effectiveness of Google Adwords advertising activities.
     2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and
     Information Technologies, CSIT 2018 - Proceedings, 2, art. no. 8526633, 2019, pp. 98–101.
[14] I. Tsmots, V. Teslyuk, A. Batyuk, V. Khavalko, A. Mladenow, Information-analytical support to
     medical industry. CEUR Workshop Proceedings, 2019, 2488, pp. 246-257.
[15] V. Vasyliuk, Y. Shyika, T. Shestakevych, Modelling of the Automated Workplace of the
     Psycholinguist. 2020 IEEE 15th International Scientific and Technical Conference on Computer
     Sciences and Information Technologies, CSIT 2020 - Proceedings, 1, art. no. 9321956, pp. 276-
     279.
[16] B. Tahir, M.A. Mehmood, Corpulyzer: A Novel Framework for Building Low Resource Language
     Corpora. IEEE Access, 9, 2021, art. no. 9316706, pp. 8546-8563.
[17] P. Stiles, Beowulf 33a and Hapax Legomena. Neophilologus, 2020, 104 (2), pp. 255-261.