<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Linguistic Analysis Method of Ukrainian Commercial Textual Content for Data Mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vinnytsia National Technical University</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vinnytsia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ukraine obisikalo@gmail.com</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1943</year>
      </pub-date>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This article deals with the scientific and practical task of automatically detecting significant keywords and rubricating Ukrainian content in Internet systems based on the method of linguistic analysis of text information. The article presents theoretical and experimental substantiation of the method of linguistic analysis of Ukrainian content using Porter's stemming. The method is aiming to automatically detect significant keywords of Ukrainian content on the basis of the proposed formalization of components of analysis - grammatical (grapheme), morphological, syntactic, semantic, referential, and structural.</p>
      </abstract>
      <kwd-group>
        <kwd>Text</kwd>
        <kwd>Ukrainian</kwd>
        <kwd>Algorithm</kwd>
        <kwd>Content Monitoring</kwd>
        <kwd>Keywords</kwd>
        <kwd>Content analysis</kwd>
        <kwd>Porter's Stemmer</kwd>
        <kwd>Linguistic Analysis</kwd>
        <kwd>Syntactic Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In practical terms, the analysis of the symbolic level of the organization of naturalistic
text is limited to the separation of syntactic punctuation from the word itself, the
allocation of abbreviations, abbreviations, etc. Analysis of existing texts shows that at the
level of sign organization of the text one is using the descriptive capabilities of the
semiotic system to encode knowledge about fragments of real reality. For instance,
the use of quotation marks (for example, the movie theater “Star”) indicates that the
token in quotation marks cannot be considered in the meaning given in the dictionary.
The proper names given in the text may coincide with the spelling of common words,
but have different meanings (for example, the group Black September, Black Friday,
student Sophia Vovk (wolf), assistant Andriy Krolik (rabbit), teacher Nadiya Kohut
(cock), singer Katya Chile, singer Vinnytska Alona, Actor David Duchovny as Fox
William Mulder, Liberty Avenue, May 1 Street, Actress Sarah Gabriel, Actress
Nastya Zadorozhna, etc.). In addition, a number of tokens in the text are not subject to
the grammatical rules of the language, but act as semantic units of sign level (for
example, the number 30, the percentage value of 15%, the reduction of millions,
thousand or kg, etc.). These features of naturalistic text make it necessary to develop a
significant level of text organization as the initial stage of building a model for
understanding text. A linguistic method of processing textual information to automatically
detect meaningful keywords consists of six steps.
1. Grammatical (graphemic) analysis of textual content that is, parsing text with
regard to the features of graphs of different languages.
2. Morphological analysis of textual content.
3. Syntax analysis of textual content.
4. Semantic analysis of textual content.
5. Reference analysis for the formation of interphase unities.
6. Structural analysis of textual content.</p>
      <p>The input of the graphemic (two-graphemic) analysis or parsing step is the current
text file and the a priori reference models (lines and characters). Separation of such
units of text as a name, designation, title, etc. allows us to identify at this stage some
functional elements of the structure of concepts. Therefore, it is advisable to begin
with an analysis of character-level text to solve the urgent problem of forming
effective domain-specific knowledge recognition procedures. The electronic component of
this is electronic dictionaries of abbreviations, geographical names, names. This
approach is caused by the diversity of sign (grapheme) representation of lexical units in
the text, which defines their different semantic functions in one context or another.
For the automated processing of naturalistic information, it is also essential to define
the structure of the text - to separate service information, highlight paragraphs,
headings, and more. The text is considered as a sort of organized sequence of lines and
graphemes.
2</p>
      <p>Relate the Highlighted Issue to Important Scientific and
Practical Work
The article deals with the scientific and practical task of automatically detecting
significant keywords and rubricating Ukrainian-language content in Internet systems
based on the method of linguistic analysis of textual information. The work was
performed within the framework of joint scientific researches of the Department of
Information Systems and Networks of the Lviv Polytechnic National University on the
topic «Research, development and implementation of intelligent distributed
information technologies and systems based on database resources, data warehouses, data
spaces and knowledge in order to accelerate the formation processes of modern
information society», as well as the department of automation and
informationmeasuring technique of Vinnytsia National Technical University within spine
Research Center of Applied and Computational Linguistics. The results of the research
were carried out within the framework of the state budget research works on the
topics "Development of methods, algorithms and software for modeling, designing and
optimization of intellectual information systems based on Web technologies «WEB»
and "Intelligent information technology of image analysis of text and synthesis of
integrated knowledge base language content". Scientific research was also carried out
within the framework of the initiative topics of the ISM Department of Lviv
Polytechnic National University on the development of intelligent distributed systems
based on an ontological approach to integrate information resources.
3</p>
      <p>
        Analysis of Recent Research and Publications
Text content (article, commentary, book, etc.) contains a considerable amount of data
in natural language, some of which is abstract [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">1-7</xref>
        ]. The text is presented as a unified
sequence of character units, the main properties of which are information, structural
and communicative connectivity / integrity, which reflects the content / structure of
the text [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref20 ref21 ref22 ref8 ref9">8-22</xref>
        ]. The method of text processing is linguistic content analysis (e.g.,
comments, forums, articles, etc.) [
        <xref ref-type="bibr" rid="ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref30">23-30</xref>
        ]. The process of text processing divides the
content into tokens using finite state machines (Fig. 1).
As a functional-semantic-structural unity, the text conforms to the rules of
construction, reveals the regularities of content and formal connection of constituent units.
Cohesiveness is manifested through external structural indicators and formal
dependence of the text components, and integrity through thematic, conceptual and modal
dependence. Integrity leads to a meaningful and communicative organization of text,
and coherence to a form, a structural organization. Therefore, it is proposed to analyze
the multilevel content structure in the analysis: linear sequence of characters; linear
sequence of morphological structures; linear sequence of sentences; a network of
interconnected unities (Alg. 1).
      </p>
      <p>Algorithm 1. Linguistic analysis of textual content.</p>
      <p>Stage 1. Grammatical (graphemic) analysis of textual content С1 .</p>
      <p>Step 1. Divide textual commercial content С1  С2 into sentences and paragraphs.</p>
    </sec>
    <sec id="sec-2">
      <title>Step 2. Divide the content character chain С2 into words.</title>
      <p>Step 3. Allocate numbers, numbers, dates, unchanged turns, and content cuts С2 .</p>
    </sec>
    <sec id="sec-3">
      <title>Step 4. Publish non-text content С2 characters.</title>
      <p>Step 5. Formation and analysis of linear content enhancement technology for
content С2 .</p>
      <p>Stage 2. Morphological analysis of textual content С2 .</p>
      <p>Step 1. Obtained the basic (word form with cut offs).</p>
      <p>Step 2. A grammatical category is formed for different words (collection of
grammatical meanings: rarity, deviation, deviation).</p>
      <p>Step 3. Formation of linear ability of morphological structure.</p>
      <p>Stage 3. Syntax analysis α4 : С2 ,U K ,T   С3 of textual content С2 .</p>
    </sec>
    <sec id="sec-4">
      <title>Stage 4. Semantic analysis of textual content С3 .</title>
      <p>Step 1. The word matches the semantic classes in the dictionary.</p>
      <p>Step 2. Selection of morphosemantic alternatives required for this review.
Step 3. Cut the words into a single structure.</p>
      <p>Step 4. Generate an orderly number of superposition entries with basic lexical
functions and semantic classes. The accuracy of results is the most commonly used /
corrective dictionary.</p>
      <p>Stage 5. Reference analysis for interphase unities.</p>
      <p>Step 1. Contextual analysis of text commercial content С3 . With it, the resolution
of local references (the one that is, his) is realized and the expression of the
expression is the kernel of unity.</p>
      <p>Step 2. Thematic analysis. Separation of statements on a theme and a rheum
allocates thematic structures which are used, for example, at formation of a digest.</p>
      <p>Step 3. Determine the regular repetition, synonymization and re-nomination of
keywords; the identity of the reference, that is, the ratio of words to the subject of the
image; presence of implication based on situational connections.</p>
      <p>Stage 6. Structural analysis of textual content С3 . The prerequisites for use are a
high degree of coincidence of terms of unity, a discursive unit, a sentence in a
semantic language, utterance, and an elementary discursive unit.</p>
      <p>Step 1. Identify the basic set of rhetorical connections between content unities.</p>
      <p>Step 2. Building a nonlinear unity network. The openness of a link set involves its
extension and adaptation to analyze the structure of the text С3 .</p>
      <p>Let us consider in detail each of the stages of the proposed algorithm.</p>
      <p>Step 1. Grammatical (graphemic) analysis of textual content. The grapheme is
called the minimum content unit of written text. The objective of this level of
recognition is to build a formalized representation of the grapheme structure of the text and to
develop a formal apparatus for separating and classifying text units on multiple lines
and graphs. Generalized recognition algorithm works with certain restrictions on the
input text: formatted width; does not contain hyphenation; Does not contain objects as
a table, figure, formula or graphic symbol; submitted in known languages, such as
English, Ukrainian, and German, rather than Ancient Egyptian, Mongolian, or Elven.
The ultimate goal of recognizing the graphemic level of text representation is to build
a grapheme structure of text, which includes separating on a plurality of lines and
graphemes of input such semantically independent units of text as fragments
(discourses), sentences, syntagms, tokens, and defining the types (classes) of enumerated
units of text and units the relationship between them in a specific input text. The
process of recognition at the grapheme level of the text representation involves two
stages, as shown in Fig. 2.</p>
      <p>Input data
Input text information (line,</p>
      <p>character)</p>
      <p>Row classifier
Character classifier
Reference models</p>
      <p>Dictionary of names
Dictionary of geographical</p>
      <p>names
Glossary of abbreviations</p>
      <p>Software modules
The procedure of grapheme</p>
      <p>analysis
The procedure of pragmatic</p>
      <p>analysis
Procedure for data</p>
      <p>formation for
morphological analysis</p>
      <p>Output data
Marked text
Fragments
Tokens:
- language,
- non-linguistic,
- conventionally linguistic
Grapheme structure</p>
      <p>Fragments
Sentence
Syntagms
Tokens
Relation
Fig. 2. Structural-logistical scheme of recognition of knowledge from the subject area at the
graphemic stage of textual information analysis
The purpose of the first stage is to separate substantively separate fragments in the
text, tokens in each fragment of the text and determine the language of the input text
and / or fragments of the text. The input of the first stage is the current text file and a
priori reference models of rows and graphs. The string classifier includes the
following significant classes: empty string (EmpStr), full string (full string, FulStr),
incomplete right (IncRgt), incomplete left (IncLgt), symmetric incomplete (SmtInc). The
rules for recognizing lines in the text are given in Table. 1. Many reference models of
graphems are conveniently presented in the normal Backus-Naur (BPF) form, the
abbreviations of which are given in the table. 2. Consider grammar G = &lt; V, T, S, P &gt;,
where the alphabet is V = &lt; Gr, T &gt;; terminal symbols are T : = &lt; A, B, C, D, E, F, G,
H, I, J, K, L, M, N, O, Q, P, R, S, T, U, W, V, X, Y, Z, Ä, Ö, Ü, Ą, Ć, Ę, Ł, Ń, Ó, Ś,
Ź, Ż, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, w, v, x, y, z, ä, ö, ü, ß, ą, ć, ę, ł,
ń, ó, ś, ź, ż, А, Б, В, Г, Д, Е, Ж, З, И, Й, К, Л, М, Н, О, П, Р, С, Т, У, Ф, Х, Ц, Ч, Ш,
Щ, Ь, Ю, Я, Є, І, Ї, Ґ, Ы, Э, Ъ, а, б, в, г, д, е, ж, з, и, й, к, л, м, о, н, п, р, с, т, у, ф,
х, ц, ч, ш, щ, ь, ю, я, є, і, ї, ґ, ы, э, ъ, ‘&gt; and a list of tuples for those defined in Table
2 elements of the set of reference models of graphems:
1. Gr : = &lt; Sb  Sp &gt; is recognized text content as a set of characters and spaces;
2. Sp : = &lt;_&gt; is space as terminal symbol;
3. Sb : = &lt; Ltr  Dgt  Ssb  Ssg &gt; is plural letters, numbers, special characters, and
syntax characters;
4. Ltr : = &lt;LatCyrEngGerPolUkrRusCplSmlCnlVwl‘&gt; is a set of
Latin, Cyrillic, English, German, Polish, Ukrainian, Russian letters, including
uppercase and lowercase letters of the respective languages, consonants and vowels,
as well as apostrophe as a terminal symbol;
5. Dgt : = &lt; 0  1  2  3  4  5  6  7  8  9 &gt; is a set of numbers;
6. Ssb : = &lt; Osb  Bsb  Msb &gt; is service characters set, brackets and math symbols;
7. Ssg : = &lt;«»“”,.:;-?!&gt; is terminal symbols of syntactic characters;
8. Cpl : = &lt; Lcp  Ccp  Ecp  Gcp  Pcp  Ucp  Rcp &gt; is a set of capital letters
of the respective languages;
9. Sml : = &lt; Lsm  Csm  Esm  Gsm Psm  Usm Rsm &gt; is a set of lowercase
letters of the respective languages;
10. Lat : = &lt; Lcp  Lsm &gt; is a set of Latin letters both capital and small;
11. Cyr : = &lt; Ccp  Csm &gt; is a set of Cyrillic letters both capital and small;
12. Eng : = &lt; Ecp  Esm &gt; is a set of English letters both capital and small;
13. Ger : = &lt; Gcp  Gsm &gt; is a set of German letters both capital and small;
14. Pol : = &lt; Pcp  Psm &gt; is a set of Polish letters both capital and small;
15. Ukr : = &lt; Ucp  Usm &gt; is a set of Ukrainian letters both capital and small;
16. Rus : = &lt; Rcp  Rsm &gt; is a set of Russian letters both capital and small;
17. Osb : = &lt;№%/@#$&amp;*\&gt; is a set of terminal service characters;
18. Bsb : =&lt; [  ]  {  }  (  ) &gt; is a set of terminal characters of brackets;
19. Msb : = &lt; +  &lt;  &gt;  = &gt; is a set of terminal mathematical symbols;
20. Cnl : = &lt; Ecc  Esc  Gcc  Gsc  Pcc  Psc  Ucc  Usc  Rcc  Rsc &gt; is a
set of uppercase and lowercase letters of the respective languages;
The following production rules are offered to recognize the language of the text:
P := &lt; SS Gr, S, GrGr Sb, GrGr Sp, Gr, Sp_, SbLtr, Sb Dgt, SbSsb,
SbSsg, LtrLat, LtrCyr, LtrEng, LtrGer, LtrPol, LtrUkr, LtrRus, LtrCpl,
LtrSml, LtrCnl, LtrVwl, Ltr‘, SsbOsb, SsbBsb, SsbMsb, CplLcp, CplCcp,
CplEcp, CplGcp, CplPcp, CplUcp, CplRcp, SmlLsm, SmlCsm, SmlEsm, Sml
Gsm, SmlPsm, SmlUsm, SmlRsm, LatLcp, LatLsm, CyrCcp, CyrCsm, Eng
Ecp, EngEsm, GerGcp, GerGsm, PolPcp, PolPsm, UkrUcp, UkrUsm, Rus
Rcp, RusRsm, LcpLcc, LcpLcv, LcpQ, LcpV, LcpX, LsmLsc, LsmLsv, Lsm
q, Lsmv, Lsmx, CcpCcc, CcpCsv, CcpЬ, CcpЙ, CsmCsc, CsmCsv, Csm
ь, Csmй, EcpLcc, EcpLcv, EcpQ, EcpV, EcpX, EsmLsc, EsmLsv, Esm
q, Esmv, Esmx, GcpLcc, GcpLcv, GcpÄ, GcpÖ, GcpÜ, GcpQ, GcpV,
GcpX, GsmLsc, GsmLsv, Gsmä, Gsmö, Gsmü, Gsmß, Gsmq, Gsmv, Gsm
x, PcpLcc, PcpLcv, PcpĄ, PcpĆ, PcpĘ, PcpŁ, PcpŃ, PcpÓ, PcpŚ,
PcpŹ, PcpŻ, PsmLsc, PsmLsv, Psmą, Psmć, Psmę, Psmł, Psmń, Psmó,
Psmś, Psmź, Psmż, UcpCcc, UcpCcv, UcpЄ, UcpІ, UcpЇ, UcpҐ, Usm
Csc, Usm Csv, Usmє, Usmі, Usmї, Usmґ, RcpCcc, RcpCcv, RcpЫ, RcpЭ,
RcpЪ, RsmCsc, RsmCsv, Rsmы, Rsmэ, Rsmъ, LccB, LccC, LccD, Lcc
F, LccG, LccH, LccJ, LccK, LccL, LccM, LccN, LccP, LccR, LccS, Lcc
T, LccW, LccZ, LcvA, LcvE, LcvI, LcvO, LcvU, LcvY, Lscb, Lscc,
Lscd, Lscf, Lscg, Lsch, Lscj, Lsck, Lscl, Lscm, Lscn, Lscp, Lscq, Lsc
r, Lscs, Lsct, Lscw, Lscx, Lscz, Lsva, Lsve, Lsvi, Lsvo, Lsvu, Lsvv,
Lsvy, CccБ, CccВ, CccГ, CccД, CccЖ, CccЗ, CccК, CccЛ, CccМ,
CccН, CccП, CccР, CccС, CccТ, CccФ, CccХ, CccЦ, CccЧ, CccШ,
CccЩ, CsvА, CsvЕ, CsvИ, CsvО, CsvУ, CsvЮ, CsvЯ, Cscб, Cscв,
Cscг, Cscд, Cscж, Cscз, Cscк, Cscл, Cscм, Cscн, Cscп, Cscр, Cscс,
Cscт, Cscф, Cscх, Cscц, Cscч, Cscш, Cscщ, Csvа, Csvе, Csvи, Csv
о, Csvу, Csvю, Csvя &gt;.</p>
      <p>These production rules are used to identify meaningful units of analysis  U C ,U G 
text commercial content X  (phrase, sentence, theme, idea, author, character, social
situation, part of the text, clustered in the content of the category of analysis) (STEP 1
parsing based on the language of the text fragments) by a modified Potter algorithm
(STAGE 2 stemming). We have the following requirements for choosing a linguistic
unit of analysis: great for interpreting value; small in order not to interpret many
meanings; easily identified; the number of units is large to isolate the sample.</p>
      <p>
        Stage 2. Morphological analysis of textual content is to find the basics of words,
for example [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] cuts out suffixes, prefixes, etc., leaving only the basis of the word
(stemming). There are known algorithms for finding the basics, for example [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] cuts
out suffixes, prefixes, etc., leaving only the base of the word. They also cut out the
key words with a simple word-selection function, then each of the words is
recognized by the base and written into a table, for example: keywords. However, we have
the disadvantage - we need to take into account all the rules of formation of words in
the Ukrainian language (flexions depending on gender and pronunciation, parts of the
language, suffixes, prefixes, alternation of words in the basis of pronunciation,
singular and plural, etc.). For example, for such words from the set M = {пошуковими,
користувачам, високорейтингового, рейтингу}, such algorithms do not work (blue
indicates the reason why it does not work - was not included in the rules). Increasing
the rules geometrically increases the workload on the processing processes, for
example, the task of checking and defining keys for 100 articles a day requires you to
check every word through the finisher, suffixes, etc. - the complexity of the algorithm
increases to the critical limit. For English-language texts, the complexity is less - there
are only two cases and one ending for nouns. Already for German the complexity is
increasing - 4 letters, compound words are spelled together with 2, 3 and more words
and more. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] the algorithm works for L = {Автомат – Автомат, Автомата –
Автомат, Автоматом – Автомат, Ресурсів – ресурс}. But it is better not to find
the root by cutting it off unnecessarily, but by having thematic dictionaries of the
basics of key words to find in the text these basics of words, their distribution (more
at the beginning, or at the end, or in the middle of the text), and frequency of use
relative to the total volume. And through the basis of doing statistics, it is to calculate the
number of identical bases. There is a well-known algorithm for English-language
texts - Porter's stemmer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], but for Ukrainian texts it does not work perfectly.
      </p>
      <p>
        Porter’s Stemmer is a Stemming algorithm published by Martin Porter in 1980.
The original version of Stemmer was designed for English and was written in BCPL.
Subsequently, Martin created the Snowball project and, using the basic idea of the
algorithm, wrote Stemmer for common Indo-European languages, including Russian
[
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17">10-17</xref>
        ]. The algorithm does not use the bases of words, but only, following a series of
rules, cuts off endings and suffixes, based on the features of the language, and
therefore works quickly, but not always error-free. The algorithm was very popular and
duplicated, often changed by different developers, and not always successfully.
Around 2000, Porter decided to “freeze” the project and continue to distribute a single
implementation of the algorithm (in several popular programming languages) from
his site [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17">10-17</xref>
        ]. For example, this algorithm takes into account in Ukrainian-language
texts only the presence of an ending, and the suffixes - not then, the words search,
search identifies, and search - no. The form of the flexion determines the type of
word, for example,
var $ADJECTIVE =
'/(ими|ій|ий|а|е|ова|ове|ів|є|їй|єє|еє|я|ім|ем|им|ім|их|іх|ою|йми|іми|у|
ю|ого|ому|ої)$/'; //http://uk.wikipedia.org/wiki/Прикметник +
http://wapedia.mobi/uk/Прикметник
var $PARTICIPLE = '/(ий|ого|ому|им|ім|а|ій|у|ою|ій|і|их|йми|их)$/';
//http://uk.wikipedia.org/wiki/Дієприкметник
var $VERB =
'/(сь|ся|ив|ать|ять|у|ю|ав|али|учи|ячи|вши|ши|е|ме|ати|яти|є)$/';
//http://uk.wikipedia.org/wiki/Дієслово
var $NOUN =
'/(а|ев|ов|е|ями|ами|еи|и|ей|ой|ий|й|иям|ям|ием|ем|ам|ом|о|у|ах|иях|ях|ы
|ь|ию|ью|ю|ия|ья|я|і|ові|ї|ею|єю|ою|є|еві|ем|єм|ів|їв|\'ю)$/';
//http://uk.wikipedia.org/wiki/Іменник
Features of the algorithm. The algorithm works with individual words, so the
context in which the word is used is unknown. Linguistics categories such as word
structure (root, suffix, etc.) and parts of language (noun, adjective, etc.) are also not
available. We currently have the following techniques for analyzing words:
 The term is cut off from the word, for example, ending the увати turns the word
критикувати into a критик.
 The word has an constant ending. Words with this ending remain unchanged.
Example – ск and constant words блиск, тиск, обеліск and more.
 The word changes the ending. This rule applies to words in which certain letters
fall out during cancellation (ядро and ядер – ending ер changes by p) or change
(чоловік and чоловіче - к changes by ч).
 The word corresponds to a regular expression. This is an attempt to combine
several rules into one difficult one. Perhaps this technique will not live up to the final
version of the algorithm. But now, the code contains expressions similar to: (ов)*,
ува(в|вши|вшись|ла|ло|ли|ння|нні|нням|нню|ти|вся|всь|лись|лися|тись|тися)
 The word does not change when it is being staged, but it is an exception to the
rules. This is an undesirable case for the algorithm. It forces the vocabulary of
exception words to hold. Examples of the віче, наче.
 The word changes during stemming, but is also an exception. This is the worst case
for the algorithm because it forces two words to be stored in the dictionary at once:
the original and the stemmed. For example, the word відер should be changed to
відр, although other words ending as ер are not so categorized (авіадиспетчер,
вітер, гравер etc.).
 Short words remain unchanged. Service parts of a language (prepositions,
conjunctions, parts) are usually very short words that are ignored by the algorithm (words
up to 2 letters inclusive).
      </p>
      <p>
        All of these techniques are applied by groups that form the rules of stemming. But
this significantly complicates the algorithm for finding keychains. Therefore, it is first
suggested to consider common endings - not traditional endings, as part of a word, but
the sequence of letters that end a word (Table 3-4). In the Table 3-4 endings of words
1 to 4 letters long are given. Five or more letters are not taken into account, since
there are not enough such words (for the maximum of 5 йтесь (6837), for 6 - ванням
(4656), etc.). This has created a kind of map for the project of stemming. The purpose
of the project is to build a static termination tree and to capture the algorithm of all
branches of the tree. Generally, a more detailed tree can be built [
        <xref ref-type="bibr" rid="ref18 ref19 ref20 ref21 ref22">18-22</xref>
        ], but for
commercial content we choose a weighted level of detail - from 500 words with common
ending. Consider in more detail the idea of a Porter’s stemmer, namely finding the
basis of a word for a given source word [
        <xref ref-type="bibr" rid="ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref30">23-30</xref>
        ]. The algorithm does not use the bases
of words, but works consistently using a number of rules for truncating endings and
suffixes (Fig. 3).
      </p>
      <sec id="sec-4-1">
        <title>First, let's introduce some definitions:</title>
        <p> Vowels letters are а, е, і, ї, о, у, и, е, ю, я.
 RV is part of the word after the first vowel. It is empty if there are no vowels in the
word.
 R1 is part of the word after the first combination is vowel-consonant.
 R2 is part of R1 after the first combination is vowel-consonant.</p>
        <p>For example, in the word інформаційний: RV = нформаційний, R1 = формаційний,
R2 = маційний. Now let's define several classes of word endings, leaving their
original names in the original description of the algorithm.</p>
        <p>Class 1. PERFECTIVE GERUND
 Group 1: в, вши, вшися. The ending should be preceded by the letter а or я.
 Group 2: ив, ивши, ившися.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Class 2. ADJECTIVE</title>
        <p>а, е, і, и, ими, іми, ій, ий, їм, ім, им, ього, ого, ьому, ому, їх, их, ую, юю, ая, яя, ою, єю.
Class 3. PARTICIPLE
 Group 1: вш, юва, ува, уч, юч, л. The ending should be preceded by letter а or я.
 Group 2: нн, н, ячи, ачи, ова, ову, єм.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Class 4. REFLEXIVE are ся, сь.</title>
        <p>Class 5. VERB
 Group 1: ла, є, єте, йте, ли, люю, й, в, єм, ємо, ний, ло, ть, но, ють, ні, ть,
єш. The ending should be preceded by the letter а or я.
 Group 2: ила, ела, ена, йте, ите, єте, юй, уй, їй, ай, ало, ив, или, имо, ений,
ило, їло, ено, ють, ать, ені, ять, іть, ить, иш, ую, ю.</p>
        <p>Class 6. NOUN are а, ев, ов, і, тя, е, ами, іями, ями, єї, єю, ями, ям, ії, и, ою, ій, ой, ий,
й, им, им, ім, ам, ом, о, у, ах, ях, ую, ю, ія, я.</p>
        <p>Class 7. SUPERLATIVE (найдовший, миліший, більший) are ш, іш.
Class 8. DERIVATIONAL (милість, щедрість, малість, крайність) is ість.
Class 9. ADJECTIVAL defined as ADJECTIVE or PARTICIPLE + ADJECTIVE.
For example: падюча = пада + юч + а.</p>
        <p>Rules. When looking for an ending, the longest one is chosen. For example, in the
word інформація ія needs to be chosen, not я. All inspections are conducted on a part
RV. So, when checking for PERFECTIVE GERUND the previous letters а and я
should also be inside the RV. Letters before RV don’t take part in inspections at all.</p>
        <p>Step 1. To find and ending of PERFECTIVE GERUND. If it exists, then delete it
and complete step. In other words, delete the ending of REFLEXIVE (if exists). Then
in the following order check and if there is an ending delete: ADJECTIVAL, VERB,
NOUN. Once one of them is found, then the step is completed.</p>
        <p>Step 2. If the word ends with і – delete і.</p>
        <p>Step 3. If in the Step 2 there will be an end DERIVATIONAL, then delete it.</p>
        <p>Step 4. One of three options is possible:</p>
      </sec>
      <sec id="sec-4-4">
        <title>1. If the word ends with н, delete the last letter. 2. If the word ends with SUPERLATIVE, delete it and delete the last letter again if the word ends with н. 3. If the word ends with ь, delete it.</title>
        <p>
          Stage 3. Syntax analysis of textual content. Syntax is known to be a set of rules that
make it possible to construct formulas and recognize the correct formulas in the
sequence of characters. It is important for the symbolic computation system that all but
one of the expression logic operations are binary. This will be based on a parser. We
will consider the process of revising the input sequence of characters in order to parse
the grammatical structure according to the given formal grammar. A parser is a
program or part of a program that performs parsing [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Generally (not just in the
computer industry), the term syntactic parsing means the breakdown of text into parts of a
language with the identification of their forms, purpose and syntactic relationship
with other parts. This is largely determined by the stage of learning the differences
and positioning parts of a particular language that can be quite difficult to formalize in
inflected languages [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. It is not at all easy to parse the sentences of such languages.
For example, there are significant ambiguities in the structure of human language, that
is, words and expressions that can themselves convey meaning in a vast number of
variants, but only one of the meanings is relevant in a particular case. The success of
choosing the right value in the vast majority of cases depends on many factors of
contextual content, and it is almost impossible to predict all combinations of meaning.
It is difficult to prepare formal rules for describing informal behavior, although, of
course, there are strict rules, many of which form the basis of the grammar that forms
the basis of the parser. During parsing, the text is framed into a data structure, usually
a tree that matches the syntax structure of the input sequence and is well suited for
further processing. As a rule, parser work in two stages: the first identifies meaningful
tokens (lexical analysis is performed), the second creates a parse tree. For example
(Fig. 4), for arithmetic expression 1+ 2*3:
A token is a sequence of one or more characters that stand out as an atomic object.
The process of forming tokens is called tokenization or lexical analysis. Tokens are
distinguished on the basis of the basic rules of the lexical analyzer (or lexer), which
often differ depending on the scope [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Tokens are often classified by the position
(location) of characters in the character sequence or context in the data stream. This is
not just about highlighting a group of characters that are delimited by punctuation on
either side (spaces or punctuation). Tokens are defined by the token rules and include
grammatical elements of the language used in the data stream. In natural languages,
these are usually categories of nouns, verbs, adjectives, or punctuation. The categories
are used in the further processing of tokens with a parser or other functions in the
program. The tasks of lexical analysis are as follows [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]:
 Convert a character set to a token sequence.
 Highlight each token as a logical part of the text (keyword, variable name,
punctuation mark, etc.).
 Matching token and token - specific token text (“for”, “variable”, “;”, etc.).
 Selection of additional token attributes (e.g. variable value).
 Formation of an output token sequence that will be used by the parser as input.
The lexical analyzer usually does nothing with the combination of tokens it has
allocated. For example, a typical lexical analyzer recognizes parentheses as characters,
but does not check for each open parenthesis "(" closed parenthesis ")". This task
remains for the parser or parser.
        </p>
        <p>Step 4. Semantic analysis of textual content. Latent-semantic analysis is a
method of processing information in natural language that allows you to analyze the
relationship between a collection of documents (messages, articles, i.e. textual
content) and terms (keywords) that occur in them. Compares some factors (topics) to all
documents and terms. Initially, words correlate with semantic vocabulary classes.
Then, the morphosemantic alternatives needed for this sentence are selected. What
follows is the linking of words into a single structure and the formation of an ordered
set of superposition entries of basic lexical functions and semantic classes. The
accuracy of the result is determined by the completeness / correctness of the dictionary.</p>
        <p>Step 5. Reference analysis for the formation of interphase unities. A contextual
analysis of textual content is carried out. With it, the resolution of local references
(the one that is, his) is realized and the expression of the expression is the kernel of
unity. The following is a thematic analysis. Separation of statements on a theme and a
rheum allocates thematic structures which are used, for example, at formation of a
digest. Define regular repeatability, synonymization and re-nomination of keywords;
the identity of the reference, that is, the relation of the words to the object of
reflection; presence of implication based on situational connections.</p>
        <p>
          Step 6. Structural analysis of textual content. The prerequisites for use are a
high degree of coincidence of terms of unity, a discursive unit, a sentence in a
semantic language, utterance, and an elementary discursive unit. A basic set of rhetorical
relationships between content unities is identified and a nonlinear unity network is
constructed [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">1-7</xref>
          ]. The openness of a link set involves its extension and adaptation to
analyze the structure of the text. There are several ways to use semantic analysis to
define keywords as a phrase, that is, to define terms Noun U K1 is nouns, noun
phrases, or noun adjective among a plurality of text content words. For example, by
the rules:
1. If the keyword is an adjective (flexion of the word ий - masculine noun). Then all
the words used to the right of this adjective in any case are found in the text (the
search follows the basis of this adjective) and a frequency dictionary is built for
them. Those phrases that use more than a certain limit (but can be used less than
the adjective itself) and are new keywords. The limit is determined by moderator.
2. If the keyword is a noun (flexion of the word is not ий), then all words to the right
and left of it are analyzed.
3. First, all words to the left of him are checked for flexion ий. A frequency
dictionary is also being built. It is determined by the set of words that occur most often
with a certain moderator defined limit - these are the new keywords.
4. Then all the words on the right are analyzed - they should all be without flexion ий.
        </p>
        <p>Similarly, a frequency vocabulary is used to define many keywords.</p>
        <p>Experimental studies. 100 scientific publications of the Bulletin of the Lviv
Polytechnic National University of the series "Information Systems and Networks"
(http://science.lp.edu.ua/sisn) from 783 and 805
(http://science.lp.edu.ua/SISN/SISN2014, http://science.lp.edu.ua/sisn/vol-cur-805-2014-2) were selected as the linguistic
base for the experimental study of the proposed method. The analysis of statistics of
functioning of the system of detection of a set of keywords from 100 scientific articles
was carried out in two stages, in particular:
1. Analyze all articles by checking common block words and thematic vocabulary.
2. Analyze all articles by checking refined blocked words and refined thematic
vocabulary (with more startup, a set of unknown words (missing both in the thematic
dictionary and in many blocked ones) is formed.</p>
        <p>In addition, at each stage, the review was performed in two steps for each article:
analysis of the entire article (http://victana.lviv.ua/index.php/kliuchovi-slova) and
analysis of the article without beginning (title, authors, editors, annotations in two
languages, authors keywords in two languages, authors' place of work), and no
literature list to determine errors in the accuracy of multiple keyword formation (Fig.5).
а)
In Fig. 5a) the diagram of analysis of statistics of formation by set of sets of all
potential keywords in comparison with the set defined by authors of articles is presented.
The first column is the average number of keywords identified by the author (4.77),
and the second is the average number of words that make up those author keywords
(9.82). The third column is the arithmetic mean of the potential keywords defined
systematically in Step 1, Step 1 (5.46); the fourth is in step 1, step 2 (6.51); the fifth
in step 1, step 1 (7.43); sixth - in step 2, step 2 (8.35). Label these columns
accordingly A1  A6 . The value A3 is different from the value A1 by 0.69 (in quantity but not
in content); accordingly A4 is different from A1 by 1.74; A5 from A1 by 2.66; A6
from A1 by 3.58. The value A2 is different from the value A3 by 4.36; accordingly
A2 from A4 by 3.31; A2 from A5 by 2.39; A2 from A6 by 1.47. Therefore, on
average, the author of an article defines fewer keywords than is actually present in this
work. Adjusting the system parameters increases the number of defined keywords by
almost 2 times (with a similar comparison the value A1 with he value A3 more at
1.144654; A4 – at 1.36478; A5 – at 1.557652; A6 – at 1.750524). The total increment
of the value obtained by the system depending on the moderation of the dictionaries is
accordingly A3 – 14.46541; A4 – 36.47799; A5 – 55.7652; A6 – 75.05241. If
consistently compare A2 with A3  A6 (in how many times the value A2 is greater), then we
get accordingly the range 1.7985; 1.5084; 1.3217; 1.176.</p>
        <p>In Figure 5, b) shows a chart that contains statistics on more detailed textures in the
analyzed articles, where 1 is an analysis of the page of the article page (husband's
manual and average reference), 2 is paragraphs in the article, 3 is lines of text, 4 is
words. 5 is characters, 6 is characters and spaces, 7 is words on the page, 8 -
characters on the page, 9 - characters and prints on the page.</p>
        <p>In Figure 6 shows the distribution diagram of the set of sets of all potential
keywords for each article compared to the set defined by the authors of the articles.
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100</p>
        <p>Keyword accuracy is enhanced during the dictionary moderation process. The
difference between the number of keywords identified by the author and the system
defined in step 1, step 1 is 44.39919% (the percentage difference). The accuracy is
improved in step 1, step 2 is 33.70672%, it is significantly improved in step 2, step 1 is
24.33809%, and in step 2, step 2 is already 14.96945%.</p>
        <p>In the Table 5 the results of an analysis of the statistics of the set of sets of all
potential keywords for each article compared to the set defined by the authors of the
articles are presented, where A is for the author's keywords, B is for the
systemdefined keywords in stage 1 (step 1), C is for the key words system-defined keywords
in stage 1 (step 2), D is for system-defined keywords in stage 2 (step 1), E is for
system-defined keywords in stage 2 (step 2). In the Table 6-7 the statistics of the analysis
of the text of the articles in the formation of sets of keywords for the construction of
appropriate histograms for groups A-E are presented.
20
15
10
5
0</p>
        <p>Author's keywords</p>
        <p>System Defined - Stage 1</p>
        <p>System Defined - Stage 2
The author of a scientific article usually chooses at his discretion the number of
keywords in the range of 2 to 8 words (most often 3-5 keywords). The system also
defines a different number of words, depending on the writing style of the particular
author (there are articles in which the system does not find any keywords by Zipf
law). For group B, most often the system determined the number of keywords 5, 7
and 3 (more than 10), although the distribution of found keywords ranged from 1 to
18 words (except 17). For Group B, the system most commonly identified keywords
as 5, 7, and 3, although the distribution of keywords found ranged from 1 to 18 words
(except 17), but the number of keywords found increased and the highest reliability
was achieved. For Group C, the system most often determined the number of
keywords 7, 6, 5, 10 and 8, although the distribution of found keywords ranged from 2 to
14 words (significantly narrowed the range). For Group D, the number of keywords 8,
5, 7, and 10 was most often determined by the system, although the distribution of the
keywords found ranged from 3 to 16 words (improved accuracy).
The article presents theoretical and experimental substantiation of the method of
linguistic analysis of Ukrainian-language commercial content using Porter's stemming.
The method is aimed at automatic detection of significant keywords of
Ukrainianlanguage content on the basis of the proposed formalization of components of
analysis is grammatical (graphemes), morphological, syntactic, semantic, referential and
structural. To implement grammatical analysis, the rules of line recognition in the text
are proposed, the set of standard graph models for 5 languages according to the
normal Backus-Naur form is determined, and the corresponding grammar G = &lt;V, T, S,
P&gt; for identifying meaningful units of textual commercial content analysis. The
morphological analysis was implemented by adapting Stemming M. Porter's algorithm to
the Ukrainian language, in particular, a static end-tree was constructed and a weighted
level of detail was selected - from 500 words with common endings, rules for
truncation of endings and suffixes were substantiated. The basic requirements and
procedures of syntactic, semantic, referential and structural analysis of Ukrainian-language
commercial content are defined. An experimental study of the method of linguistic
analysis was conducted on the materials of 100 scientific publications from two issues
(783 and 805) of the Bulletin of the Lviv Polytechnic National University of the series
"Information Systems and Networks" (http://science.lp.edu.ua/sisn). Based on the
proposed method, the keyword search system demonstrated the ability to improve
itself by forming and refining a number of common blocked words and a thematic
dictionary with the participation of moderators. It is found that, for the technical
scientific texts of the experimental base, the authors of articles usually define fewer
keywords on average than are actually present in this work. Numerous statistics show
that debugging your system's keywords nearly doubles the number of defined
keywords, without compromising accuracy or reliability. Further experimental research
will require testing the proposed method to identify keywords from other categories
of texts is scientific humanities, artistic, nonfiction, etc.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bobicev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanishcheva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherednichenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Sentiment Analysis in the Ukrainian and Russian News</article-title>
          .
          <source>In: First Ukraine Conference on Electrical and Computer Engineering</source>
          ,
          <fpage>1050</fpage>
          -
          <lpage>1055</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Sharonova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doroshenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherednichenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Issues of fact-based information analysis</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          ,
          <volume>2136</volume>
          ,
          <fpage>11</fpage>
          -
          <lpage>19</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cherednichenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Babkova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanishcheva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Complex Term Identification for Ukrainian Medical Texts</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          ,
          <fpage>146</fpage>
          -
          <lpage>154</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Khomytska</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teslyuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holovatyy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morushko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Development of methods, models, and means for the author attribution of a text</article-title>
          .
          <source>In: Eastern-European Journal of Enterprise Technologies</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          -
          <fpage>93</fpage>
          ),
          <fpage>41</fpage>
          -
          <lpage>46</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Khomytska</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teslyuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Authorship and Style Attribution by Statistical Methods of Style Differentiation on the Phonological Level</article-title>
          .
          <source>In: Advances in Intelligent Systems and Computing III. AISC 871</source>
          , Springer,
          <fpage>105</fpage>
          -
          <lpage>118</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Babichev</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>An Evaluation of the Information Technology of Gene Expression Profiles Processing Stability for Different Levels of Noise Components</article-title>
          .
          <source>In: Data</source>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ),
          <volume>48</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Babichev</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Durnyak</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pikh</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senkivskyy</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>An Evaluation of the Objective Clustering Inductive Technology Effectiveness Implemented Using Density-Based and Agglomerative Hierarchical Clustering Algorithms</article-title>
          .
          <source>In: Advances in Intelligent Systems and Computing</source>
          ,
          <volume>1020</volume>
          ,
          <fpage>532</fpage>
          -
          <lpage>553</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lytvyn</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vysotska</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peleshchak</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basyuk</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kubinska</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chyrun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rusyn</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pohreliuk</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Identifying Textual Content Based on Thematic Analysis of Similar Texts in Big Data</article-title>
          .
          <source>In: International Scientific and Technical Conference on Computer Science and Information Nechnologies (CSIT)</source>
          ,
          <fpage>84</fpage>
          -
          <lpage>91</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Moseіchuk</surname>
          </string-name>
          , V.:
          <article-title>Porter stemming algorithm for Ukrainian languages</article-title>
          . http://www.marazm.org.ua/document/stemer_ua/
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Vysotska</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lytvyn</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kubinska</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dilai</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rusyn</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pohreliuk</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chyrun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chyrun</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brodyak</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Method of Similar Textual Content Selection Based on Thematic Information Retrieval</article-title>
          .
          <source>In: International Scientific and Technical Conference on Computer Science and Information Nechnologies (CSIT)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <article-title>Russian stemming algorithm</article-title>
          . http://snowball.tartarus.org
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <article-title>Porter stemmer</article-title>
          . https://github.com/allaud/porter-stemmer
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <article-title>The Porter Stemming Algorithm</article-title>
          . http://tartarus.org/~martin/PorterStemmer/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>14. Porter Stemming Algorithm. http://snowball.tartarus.org/algorithms/porter/stemmer.html</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <article-title>English stemming algorithm</article-title>
          . http://snowball.tartarus.org/algorithms/english/stemmer.html
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M. F.</given-names>
          </string-name>
          :
          <article-title>An algorithm for suffix stripping</article-title>
          . http://telemat.det.unifi.it/book/2001/wchange/download/stem_porter.html
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Willett</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>The Porter stemming algorithm: then and now</article-title>
          . http://eprints.whiterose.ac.uk/1434/
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Senyk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>The Porter Stemming Algorithm for Ukrainian</article-title>
          . http://www.senyk.poltava.ua
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Vysotska</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandes</surname>
            ,
            <given-names>V.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lytvyn</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Emmerich</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hrendus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Method for Determining Linguometric Coefficient Dynamics of Ukrainian Text Content Authorship</article-title>
          .
          <source>In: Advances in Intelligent Systems and Computing</source>
          ,
          <volume>871</volume>
          ,
          <fpage>132</fpage>
          -
          <lpage>151</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Lytvyn</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vysotska</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pukach</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nytrebych</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demkiv</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senyk</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malanchuk</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sachenko</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalchuk</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huzyk</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>Analysis of the developed quantitative method for automatic attribution of scientific and technical text content written in Ukrainian</article-title>
          .
          <source>In: Eastern-European Journal of Enterprise Technologies</source>
          ,
          <volume>6</volume>
          (
          <issue>2</issue>
          -
          <fpage>96</fpage>
          ),
          <fpage>19</fpage>
          -
          <lpage>31</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Vysotska</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lytvyn</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hrendus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kubinska</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brodyak</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Method of textual information authorship analysis based on stylometry</article-title>
          .
          <source>In: 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies</source>
          ,
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Kulchytskyi</surname>
          </string-name>
          , I.:
          <article-title>Statistical Analysis of the Short Stories by Roman Ivanychuk</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          , Vol-
          <volume>2362</volume>
          ,
          <fpage>312</fpage>
          -
          <lpage>321</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Shandruk</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Quantitative Characteristics of Key Words in Texts of Scientific Genre (on the Material of the Ukrainian Scientific Journal)</article-title>
          .
          <source>In: CEUR Workshop Proceedings, Vol2362</source>
          ,
          <fpage>163</fpage>
          -
          <lpage>172</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Lovins</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          :
          <article-title>Development of a stemming algorithm</article-title>
          .
          <source>In: Mechanical Translation and Computational Linguistics</source>
          <volume>11</volume>
          :
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          (
          <year>1968</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Jongejan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dalianis</surname>
          </string-name>
          , H.:
          <article-title>Automatic training of lemmatization rules that handle morphological changes in pre-</article-title>
          , in- and suffixes alike. http://www.aclweb.org/anthology/P/P09/P09-1017.pdf
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Vysotska</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanishcheva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hlavcheva</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Authorship Identification of the Scientific Text in Ukrainian with Using the Lingvometry Methods</article-title>
          .
          <source>In: Computer Sciences and Information Technologies</source>
          , CSIT,
          <fpage>34</fpage>
          -
          <lpage>38</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Vysotska</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burov</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lytvyn</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demchuk</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Defining Author's Style for Plagiarism Detection in Academic Environment</article-title>
          .
          <source>In: Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining and Processing</source>
          , DSMP,
          <fpage>128</fpage>
          -
          <lpage>133</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Lytvyn</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vysotska</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burov</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bobyk</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ohirko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>The linguometric approach for co-authoring author's style definition</article-title>
          .
          <source>In: Intelligent Data Acquisition and Advanced Computing Systems, IDAACS-SWS</source>
          ,
          <fpage>29</fpage>
          -
          <lpage>34</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <article-title>Hardcoded stemmer for Ukrainian</article-title>
          . https://github.com/vgrichina/ukrainian-stemmer
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Perestoronin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>The Porter Stemming Algorithm for Russian</article-title>
          . http://blog.eigene.in/post/49598738049/snowball
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>